Get value from specific map-key in Cassandra - cassandra

For example. I have a map under the column 'users' in a table called 'table' with primary key 'Id'.
If the map looks like this, {{'Phone': '1234567899'}, {'City': 'Dublin'}}, I want to get the value from key 'Phone' for specific 'Id', in Cassandra database.

Yes, that's possible to do with CQL when using a MAP collection.
To test this, I created a simple table using the specifications and data you mentioned above:
> CREATE TABLE stackoverflow.usermap (
id text PRIMARY KEY,
users map<text, text>);
> INSERT INTO usermap (id,users)
VALUES ('1a',{'Phone': '1234567899','City': 'Dublin'});
> SELECT * FROM usermap WHERE id='1a';
id | users
----+-------------------------------------------
1a | {'City': 'Dublin', 'Phone': '1234567899'}
(1 rows)
Then, I queried with the same WHERE clause, but altering my SELECT to pull back the user's phone only:
> SELECT users['Phone'] FROM usermap WHERE id='1a';
users['Phone']
----------------
1234567899
(1 rows)

Related

How to update and replace in Cassandra a UDT field value?

Does Cassandra support update of a UDT field value? something like replacing it with a new value?
I have user_fav_payment_method UDT and I need to replace cash with debit card:
update user_ratings set
user_fav_payment_method{'cash'} = {'debit cards'}
where rating_id = 66;
This code is wrong but I need to do something similar to this, how can i do it?
Per documentation:
In Cassandra 3.6 and later, user-defined types that include only non-collection fields can update individual field values. Update an individual field in user-defined type data using the UPDATE command. The desired key-value pair are defined in the command. In order to update, the UDT must be defined in the CREATE TABLE command as an unfrozen data type.
You can use . notation to update only individual fields of the non-frozen UDT, like this:
cqlsh> use test;
cqlsh:test> create type payment_method ( method text, data text);
cqlsh:test> create table users (id int primary key, pay_method payment_method);
cqlsh:test> insert into users (id, pay_method) values (1, {method: 'cash', data: 'usd'});
cqlsh:test> select * from users;
id | pay_method
----+-------------------------------
1 | {method: 'cash', data: 'usd'}
(1 rows)
cqlsh:test> update users set pay_method.method = 'card' where id = 1;
cqlsh:test> select * from users;
id | pay_method
----+-------------------------------
1 | {method: 'card', data: 'usd'}
(1 rows)

cassandra data consistency issue

Hi I'm new in Apache Cassandra and I found article about Basic Rules of Cassandra Data Modeling. In example 1 are created 2 tables
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
This tables contains same data (username, email and age). Here I don't understand how to insert data into two tables. I think, that I have to execute two separate inserts. One for table users_by_username and one for table users_by_email. But how to maintain data consistency between tables. For example when I insert data into first table and I forgot to insert data to second table ... or the other way
It's your job as developer to make sure that data is in sync. Although, you can use things like materialized views to generate another "table" with slightly different primary key (there are some rules on what could be changed). For your case, for example, you can have following:
CREATE TABLE users_by_username (username text PRIMARY KEY,
email text, age int);
create MATERIALIZED VIEW users_by_email as SELECT * from
users_by_username where email is not null and
username is not null primary key (email, username);
and if you insert data as
insert into users_by_username (username, email, age)
values ('test', 'test#domain.com', 30);
you can query the materialized view for data in addition to query by username
SELECT * from users_by_username where username = 'test' ;
username | age | email
----------+-----+-----------------
test | 30 | test#domain.com
SELECT * from users_by_email where email = 'test#domain.com';
email | username | age
-----------------+----------+-----
test#domain.com | test | 30

How to only return some map keys (aka, slice a range of map/set elements) in CQL 3?

I'm trying to do my own CF reverse index in Cassandra right now, for a geohash lookup implementation.
In CQL 2, I could do this:
CREATE COLUMNFAMILY song_tags (id uuid PRIMARY KEY) WITH comparator=text;
insert into song_tags ('id', 'blues', '1973') values ('a3e64f8f-bd44-4f28-b8d9-6938726e34d4', '', '');
insert into song_tags ('id', 'covers', '2007') values ('8a172618-b121-4136-bb10-f665cfc469eb', '', '');
SELECT * FROM song_tags;
Which resulted in:
id,8a172618-b121-4136-bb10-f665cfc469eb | 2007, | covers,
id,a3e64f8f-bd44-4f28-b8d9-6938726e34d4 | 1973, | blues,
And allowed to return 'covers' and 'blues' via:
SELECT 'a'..'f' FROM song_tags
Now, I'm trying to use CQL 3, which has gotten rid of dynamic columns, and suggests using a set or map column type instead. sets and maps have their values/keys ordered alphabetically, and under the hood (iirc) are columns - hence, they should support the same type of range slicing... but how?
Suggest to forget what you know about 'under the hood' implementation details and focus on what the query language lets you do.
Long reason why is in CQL3, multiple rows map to a single columnfamily though the query language presents them as different rows. It's just a different way of querying the same data.
Range slicing does not exist, the query language is flexible enough to support its use cases.
To do what you want, make an index on the genres so it is query-able without using the primary key and then select the genres value itself.
The 'gotcha' is that some functions can only be performed on partition keys, like distinct. Will have to do distinct client side in that case.
For example:
CREATE TABLE song_tags (
id uuid PRIMARY KEY,
year text,
genre list<text>
);
CREATE INDEX ON song_tags(genre);
INSERT INTO song_tags (id, year, genre)
VALUES(8a172618-b121-4136-bb10-f665cfc469eb, '2007', ['covers']);
INSERT INTO song_tags (id, year, genre)
VALUES(a3e64f8f-bd44-4f28-b8d9-6938726e34d4, '1973', ['blues']);
Can then query as:
SELECT genre from song_tags;
genre
------------
['blues']
['covers']

Query results not ordered despite WITH CLUSTERING ORDER BY

I am storing posts from all users in table. I want to retrieve post from all users the user is following.
CREATE TABLE posts (
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (userid, time)
)WITH CLUSTERING ORDER BY (time DESC)
I have the data about who all user follows in another table
CREATE TABLE follow (
userid int,
who_follow_me set<int>,
who_i_follow set<int>,
PRIMARY KEY ((userid))
)
I am making query like
select * from posts where userid in(1,2,3,4....n);
2 questions:
why I still get data in random order, though CLUSTERING ORDER BY is specified in posts. ?
Is model correct to satisfy the query optimally (user can have n number of followers)?
I am using Cassandra 2.0.10.
"why I still get data in random order, though CLUSTERING ORDER BY is specified in posts?"
This is because ORDER BY only works for rows within a particular partitioning key. So in your case, if you wanted to see all of the posts for a specific user like this:
SELECT * FROM posts WHERE userid=1;
That return your results ordered by time, as all of the rows within the userid=1 partitioning key would be clustered by it.
"Is model correct to satisfy the query optimally (user can have n number of followers)?"
It will work, as long as you don't care about getting the results ordered by timestamp. To be able to query posts for all users ordered by time, you would need to come up with a different partitioning key. Without knowing too much about your application, you could use a column like GROUP (for instance) and partition on that.
So let's say that you evenly assign all of your users to eight groups: A, B, C, D, E, F, G and H. Let's say your table design changed like this:
CREATE TABLE posts (
group text,
userid int,
time timestamp,
id uuid,
content text,
PRIMARY KEY (group, time, userid)
)WITH CLUSTERING ORDER BY (time DESC)
You could then query all posts for all users for group B like this:
SELECT * FROM posts WHERE group='B';
That would give you all of the posts for all of the users in group B, ordered by time. So basically, for your query to order the posts appropriately by time, you need to partition your post data on something other than userid.
EDIT:
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
That's not going to work. In fact, that should produce the following error:
code=2200 [Invalid query] message="Missing CLUSTERING ORDER for column follows"
And even if you did add follows to your CLUSTERING ORDER clause, you would see this:
code=2200 [Invalid query] message="Only clustering key columns can be defined in CLUSTERING ORDER directive"
The CLUSTERING ORDER clause can only be used on the clustering column(s), which in this case, is only the follows column. Alter your PRIMARY KEY definition to cluster on follows (ASC) and created (DESC). I have tested this, and inserted some sample data, and can see that this query works:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2 AND follows=1;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(3 rows)
Although, if you want to query by just userid you can see posts from all of your followers. But in that case, the posts will only be ordered within each followerid, like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM posts WHERE userid=2;
userid | follows | created | id
--------+---------+--------------------------+--------------------------------------
2 | 0 | 2015-01-25 13:28:00-0600 | 94da27d0-e91f-4c1f-88f2-5a4bbc4a0096
2 | 0 | 2015-01-25 13:23:00-0600 | 798053d3-f1c4-4c1d-a79d-d0faff10a5fb
2 | 1 | 2015-01-25 13:27:00-0600 | 559cda12-8fe7-45d3-9a61-7ddd2119fcda
2 | 1 | 2015-01-25 13:26:00-0600 | 64b390ba-a323-4c71-baa8-e247a8bc9cdf
2 | 1 | 2015-01-25 13:24:00-0600 | 1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4
(5 rows)
This is my new schema,
CREATE TABLE posts(id uuid,
userid int,
follows int,
created timestamp,
PRIMARY KEY (userid, follows)) WITH CLUSTERING ORDER BY (created DESC);
Here userid represents who posted it and follows represents userid for his one of the follower. Say user x follows 10 other people , i am making 10+1 inserts. Definitely there is too much data duplication. However now its easier to get timeline for one of the user with following query
select * from posts where follows=?

what is the reason for composite column, there must be at least one column which is not part of the primary key

From online document:
A CQL 3 table’s primary key can have any number (1 or more) of component columns, but there must be at least one column which is not part of the primary key.
What is the reason for that?
I tried to insert a row only with the columns in the composite key in CQL. I can't see it when I do SELECT
cqlsh:demo> CREATE TABLE DEMO (
user_id bigint,
dep_id bigint,
created timestamp,
lastupdated timestamp,
PRIMARY KEY (user_id, dep_id)
);
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id)
... VALUES (100, 1);
cqlsh:demo> select * from demo;
cqlsh:demo>
But when I use cli, it shows up something:
default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
1 Row Returned.
Elapsed time: 27 msec(s).
But can't see the values of the columns.
After I add the column which is not in the primary key, the value shows up in CQL
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id, created)
... VALUES (100, 1, '7943-07-23');
cqlsh:demo> select * from demo;
user_id | dep_id | created | lastupdated
---------+--------+--------------------------+-------------
100 | 1 | 7943-07-23 00:00:00+0000 | null
Result from CLI:
[default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
invalid UTF8 bytes 0000ab7240ab7580
[default#demo]
Any idea?
update: I found the reason why CLI returns invalid UTF8 bytes 0000ab7240ab7580, it's not compatible with the table created for from CQL3, if I use compact storage option, the value shows up correctly for CLI.
What's really happening under the covers is that the non-key values are being saved using the primary key values which make up the row key and column names. If you don't insert any non-key values then you're not really creating any new column family columns. The row key comes from the first primary key, so that's why Cassandra was able to create a new row for you, even though no columns were created with it.
This limitation is fixed in Cassandra 1.2, which is in beta now.

Resources