I just started working with the SASI index on Cassandra 3.7.0 and I encountered a problem which as I suspected was a bug. I had hardly tracked down the situation in which the bug showed up, here is what I found:
When querying with a SASI index, it may incorrectly return 0 rows, and changing a little conditions, it works again, like the following CQL code:
CREATE TABLE IF NOT EXISTS roles (
name text,
a int,
b int,
PRIMARY KEY ((name, a), b)
) WITH CLUSTERING ORDER BY (b DESC);
insert into roles (name,a,b) values ('Joe',1,1);
insert into roles (name,a,b) values ('Joe',2,2);
insert into roles (name,a,b) values ('Joe',3,3);
insert into roles (name,a,b) values ('Joe',4,4);
CREATE TABLE IF NOT EXISTS roles2 (
name text,
a int,
b int,
PRIMARY KEY ((name, a), b)
) WITH CLUSTERING ORDER BY (b ASC);
insert into roles2 (name,a,b) values ('Joe',1,1);
insert into roles2 (name,a,b) values ('Joe',2,2);
insert into roles2 (name,a,b) values ('Joe',3,3);
insert into roles2 (name,a,b) values ('Joe',4,4);
CREATE CUSTOM INDEX ON roles (b) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'mode': 'SPARSE' };
CREATE CUSTOM INDEX ON roles2 (b) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'mode': 'SPARSE' };
Noticing that I only change table roles2 from table roles's 'CLUSTERING ORDER BY (b DESC)' into 'CLUSTERING ORDER BY (b ASC)'.
When querying with statement select * from roles2 where b<3;, the rusult is two rows:
name | a | b
------+---+---
Joe | 1 | 1
Joe | 2 | 2
(2 rows)
However, if querying with select * from roles where b<3;, it returned no rows at all:
name | a | b
------+---+---
(0 rows)
This is not the only situation where the bug would show up, one time I created a SASI index with specific name like 'end_idx' on column 'end', the bug showed up, when I didn't specify the index name, it gone.
Please help me confirm this bug, or tell me if I'd used the SASI index the wrong way.
Related
I have a cassandra table with data in it.
I add three new columns country as text, lat and long as double.
When these columns are added null values are inserted against the already present rows in the table. However, null is inserted as text in country column and null as value in lat and long columns.
Is this something that is the default behavior and can I add null as value under the newly created text columns?
Cassandra uses null to show that value is missing, not that this is explicitly inserted. In your case, when you add new columns - they are just added to table's specification stored in Cassandra itself - existing data (stored in SSTables) is not modified, so when Cassandra reads old data it doesn't find values for that columns in SSTable, and output null instead.
But you can have the same behavior without adding new columns - just don't insert value for specific regular column (you must have non-null values for columns of primary key!). For example:
cqlsh> create table test.abc (id int primary key, t1 text, t2 text);
cqlsh> insert into test.abc (id, t1, t2) values (1, 't1-1', 't2-1');
cqlsh> insert into test.abc (id, t1) values (2, 't1-2');
cqlsh> insert into test.abc (id, t2) values (3, 't3-3');
cqlsh> SELECT * from test.abc;
id | t1 | t2
----+------+------
1 | t1-1 | t2-1
2 | t1-2 | null
3 | null | t3-3
(3 rows)
What would be the easiest way to migrate an int to a bigint in Cassandra? I thought of creating a new column of type bigint and then running a script to basically set the value of that column = the value of the int column for all rows, and then dropping the original column and renaming the new column. However, I'd like to know if someone has a better alternative, because this approach just doesn't sit quite right with me.
You could ALTER your table and change your int column to a varint type. Check the documentation about ALTER TABLE, and the data types compatibility matrix.
The only other alternative is what you said: add a new column and populate it row by row. Dropping the first column can be entirely optional: if you don't assign values when performing insert everything will stay as it is, and new records won't consume space.
You can ALTER your table to store bigint in cassandra with varint. See the example-
cassandra#cqlsh:demo> CREATE TABLE int_test (id int, name text, primary key(id));
cassandra#cqlsh:demo> SELECT * FROM int_test;
id | name
----+------
(0 rows)
cassandra#cqlsh:demo> INSERT INTO int_test (id, name) VALUES ( 215478936541111, 'abc');
cassandra#cqlsh:demo> SELECT * FROM int_test ;
id | name
---------------------+---------
215478936541111 | abc
(1 rows)
cassandra#cqlsh:demo> ALTER TABLE demo.int_test ALTER id TYPE varint;
cassandra#cqlsh:demo> INSERT INTO int_test (id, name) VALUES ( 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999, 'abcd');
cassandra#cqlsh:demo> SELECT * FROM int_test ;
id | name
------------------------------------------------------------------------------------------------------------------------------+---------
215478936541111 | abc
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999 | abcd
(2 rows)
cassandra#cqlsh:demo>
I am storing posts from all users in table. I want to retrieve post from all users the user is following.
CREATE TABLE posts (
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (userid, time)
)WITH CLUSTERING ORDER BY (time DESC)
I have the data about who all user follows in another table
CREATE TABLE follow (
userid int,
who_follow_me set<int>,
who_i_follow set<int>,
PRIMARY KEY ((userid))
)
I am making query like
select * from posts where userid in(1,2,3,4....n);
2 questions:
why I still get data in random order, though CLUSTERING ORDER BY is specified in posts. ?
Is model correct to satisfy the query optimally (user can have n number of followers)?
I am using Cassandra 2.0.10.
"why I still get data in random order, though CLUSTERING ORDER BY is specified in posts?"
This is because ORDER BY only works for rows within a particular partitioning key. So in your case, if you wanted to see all of the posts for a specific user like this:
SELECT * FROM posts WHERE userid=1;
That return your results ordered by time, as all of the rows within the userid=1 partitioning key would be clustered by it.
"Is model correct to satisfy the query optimally (user can have n number of followers)?"
It will work, as long as you don't care about getting the results ordered by timestamp. To be able to query posts for all users ordered by time, you would need to come up with a different partitioning key. Without knowing too much about your application, you could use a column like GROUP (for instance) and partition on that.
So let's say that you evenly assign all of your users to eight groups: A, B, C, D, E, F, G and H. Let's say your table design changed like this:
CREATE TABLE posts (
group text,
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (group, time, userid)
)WITH CLUSTERING ORDER BY (time DESC)
You could then query all posts for all users for group B like this:
SELECT * FROM posts WHERE group='B';
That would give you all of the posts for all of the users in group B, ordered by time. So basically, for your query to order the posts appropriately by time, you need to partition your post data on something other than userid.
EDIT:
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
That's not going to work. In fact, that should produce the following error:
code=2200 [Invalid query] message="Missing CLUSTERING ORDER for column follows"
And even if you did add follows to your CLUSTERING ORDER clause, you would see this:
code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
The CLUSTERING ORDER clause can only be used on the clustering column(s), which in this case, is only the follows column. Alter your PRIMARY KEY definition to cluster on follows (ASC) and created (DESC). I have tested this, and inserted some sample data, and can see that this query works:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2 AND follows=1;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(3 rows)
Although, if you want to query by just userid you can see posts from all of your followers. But in that case, the posts will only be ordered within each followerid, like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 0 | 2015-01-25 13:28:00-0600 | 94da27d0-e91f-4c1f-88f2-5a4bbc4a0096
2 | 0 | 2015-01-25 13:23:00-0600 | 798053d3-f1c4-4c1d-a79d-d0faff10a5fb
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(5 rows)
This is my new schema,
CREATE TABLE posts(id uuid,
userid int,
follows int,
created timestamp,
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
Here userid represents who posted it and follows represents userid for his one of the follower. Say user x follows 10 other people , i am making 10+1 inserts. Definitely there is too much data duplication. However now its easier to get timeline for one of the user with following query
select * from posts where follows=?
I see extra column being created in my column family when I use cql comparing to cli.
Create table using CQL and insert row:
cqlsh:cassandraSample> CREATE TABLE bedbugs(
... id varchar,
... name varchar,
... description varchar,
... primary key(id, name)
... ) ;
cqlsh:cassandraSample> insert into bedbugs (id, name, description)
values ('Cimex','Cimex lectularius','http://en.wikipedia.org/wiki/Bed_bug');
Now insert column using cli:
[default#cassandraSample] set bedbugs['BatBedBug']['C. pipistrelli:description']='google.com';
Value inserted.
Elapsed time: 1.82 msec(s).
[default#cassandraSample] list bedbugs
... ;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: Cimex
=> (column=Cimex lectularius:, value=, timestamp=1369682957658000)
=> (column=Cimex lectularius:description, value=http://en.wikipedia.org/wiki/Bed_bug, timestamp=1369682957658000)
-------------------
RowKey: BatBedBug
=> (column=C. pipistrelli:description, value=google.com, timestamp=1369688651442000)
2 Rows Returned.
cqlsh:cassandraSample> select * from bedbugs;
id | name | description
-----------+-------------------+--------------------------------------
Cimex | Cimex lectularius | http://en.wikipedia.org/wiki/Bed_bug
BatBedBug | C. pipistrelli | google.com
So, cql creates one extra column for each row, with empty non-primary key columns. Isn't it waste of space?
When you created a column family using CQLSh and specified primary key(Id, name) you make cassandra create two indices of the data stored one for data sorted by ID and the other for data sorted by name. but when you do this by cassandra-cli your column family doesn't have the index column. cassandra-cli doesn't support having secondary indexes. I hope I made sense to you I lack words to explain my understanding.
For compatibility with cassandra-cli and to prevent this extra column from being created, change your create table statement to include "WITH COMPACT STORAGE".
described here
So
CREATE TABLE bedbugs(
id varchar,
name varchar,
description varchar,
primary key(id, name)
);
becomes
CREATE TABLE bedbugs(
id varchar,
name varchar,
description varchar,
primary key(id, name)
) WITH COMPACT STORAGE;
WITH COMPACT STORAGE is also how you would go about supporting wide rows in cql.
From online document:
A CQL 3 table’s primary key can have any number (1 or more) of component columns, but there must be at least one column which is not part of the primary key.
What is the reason for that?
I tried to insert a row only with the columns in the composite key in CQL. I can't see it when I do SELECT
cqlsh:demo> CREATE TABLE DEMO (
user_id bigint,
dep_id bigint,
created timestamp,
lastupdated timestamp,
PRIMARY KEY (user_id, dep_id)
);
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id)
... VALUES (100, 1);
cqlsh:demo> select * from demo;
cqlsh:demo>
But when I use cli, it shows up something:
default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
1 Row Returned.
Elapsed time: 27 msec(s).
But can't see the values of the columns.
After I add the column which is not in the primary key, the value shows up in CQL
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id, created)
... VALUES (100, 1, '7943-07-23');
cqlsh:demo> select * from demo;
user_id | dep_id | created | lastupdated
---------+--------+--------------------------+-------------
100 | 1 | 7943-07-23 00:00:00+0000 | null
Result from CLI:
[default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
invalid UTF8 bytes 0000ab7240ab7580
[default#demo]
Any idea?
update: I found the reason why CLI returns invalid UTF8 bytes 0000ab7240ab7580, it's not compatible with the table created for from CQL3, if I use compact storage option, the value shows up correctly for CLI.
What's really happening under the covers is that the non-key values are being saved using the primary key values which make up the row key and column names. If you don't insert any non-key values then you're not really creating any new column family columns. The row key comes from the first primary key, so that's why Cassandra was able to create a new row for you, even though no columns were created with it.
This limitation is fixed in Cassandra 1.2, which is in beta now.