Cassandra how can I simulate a join statement - cassandra

I am new to cassandra and am coming from Postgres. I was wondering if there is a way that I can get data from 2 different tables or column family and then return the results. I have this query
select p.fullname,p.picture s.post, s.id, s.comments, s.state, s.city FROM profiles as p INNER JOIN Chats as s ON(p.id==s.profile_id) WHERE s.latitudes>=28 AND 29>= s.latitudes AND s.longitudes
">=-21 AND -23>= s.longitudes
The query has 2 tables: Profiles and Chat and they both share a common field Chats.id==Proifles.profile_id it boils down to this basically return all rows where Chat ID is equal to Profiles id. I would like to keep it that way because now updating profiles are simple and would only need to update 1 row per profile update instead of de-normalizing everything and updating thousands of records. Any help or suggestions would be great

You have to design tables in way you won't need joins. Best practice is if your table matches exactly the use case it is used for.
Cassadra has a feature called shared static columns; this allows you to bind values with partition part of primary key. Thus, you can create "joined" version of table without duplicates.
CREATE TABLE t (
p_id uuid,
p_fullname text STATIC,
p_picture text STATIC,
s_id uuid,
s_post text,
s_comments text,
s_state text,
s_city text,
PRIMARY KEY (p_id, s_id)
);

Related

How to select data in Cassandra either by ID or date?

I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.

How is denormalization handled in cassandra

What is the best approach to update table with duplicate data?
I have a table
table users (
id text PRIMARY KEY,
email text,
description,
salary
)
I will delete, update, insert etc to this table. But I also have a requirement to be able to search by email, and description. If I create new table with new composite keys for email, and description,
when I update my base table I do
insert into users (id, salary) values (1, 500);
I do not have the required data to also update my secondary table since all the client has is id and salary. How is the second table updated.
Other workarounds and shortcomings
I could have created a materialized view, but since the base table has only one primary key I can only add one more column. my search requirement requires more than one column.
Create secondary indexes on the columns that will be searched on. But the performance for this would be bad since the columns I will be searching on would have high cardinality. i.e. description, email, etc
So, the "correct" way of doing this is to create 3 tables. salary_by_id, salary_by_email and salary_by_description.
table salary_by_id (
id text PRIMARY KEY,
salary int
)
table salary_by_email (
email text PRIMARY KEY,
salary int
)
table salary_by_description (
description text,
id int,
salary int,
primary key (description, id)
)
The reason i added id to salary_by_description is that, from my own guessing, description won't be globally uniq, so it has to have something else in it's primary key.
Depending on the size of these tables the last one might need something extra added to it's partitioning key. And if needed you can add id, email and description to the other tables.
Now, when inserting or deleting values you need so do it in all 3 tables. If you use a driver, like in java, that supports asynchronous calls, then this doesn't cost very much extra.

Data modeling easy table in Cassandra not working

I have to design a web page where a group leader can invite people to join his/her group. My requirements are really simple.
No sending duplicate emails out, if person was already contacted.
Show the group leader a list of invites sorted by invite date in ascending order.
Seems easy. I created this table.
CREATE TABLE invites (
email_address text,
invite_date timeuuid,
PRIMARY KEY (email_address, invite_date)
) WITH CLUSTERING ORDER BY (invite_date ASC);
Problem 1: LWT no use with invite_date as a Cluster column.
I figured I'd use LWT to ensure email_address is unique, only to find out IF NOT EXISTS only seems to work on the whole PRIMARY KEY, so LWT in C* does not work for me.
Problem 2: I cannot get an ordered list of invites back to save me life even with invite_date as a Cluster column.
If I take invite_date out, I cannot issue an 'order by' in CQL. That said, having invite_date out of the PK let's me use LWT...
I can't even get a 2 column table to fulfill 2 easy requirements! Any help on data modeling design for this problem is much appreciated.
New Dec. 4, 2015:
Additional to business requirements, a technical requirement I have is:
I want to make sure I model this correctly in Cassandra, so that it allows me to use CQL's LIMIT and pagingState capabilities in the Java driver. This means, I cannot just read all the rows in, sort on Java side and return the results.
Problem 1:
I think that the easiest way to handle this might be to have two separate tables, one for the emails_in_group and one for invites_by_group. This will allow each query to be fulfilled independantly. The emails_in_group table would look something like this:
CREATE TABLE emails_in_group (
email_address text,
group_id text,
PRIMARY KEY (email_address , group_id));
Then this, combined with the table as defined in Problem 2 below could be updated using a conditional batch statement as shown here:
http://docs.datastax.com/en/cql/3.1/cql/cql_using/use-batch-static.html
Problem 2:
So the basic problem here is that as you have your data currently modeled each email_address value will be in it's own partition and then within that partition the invite_date will be ordered. #bydsky is right when he said that you need to add something like a group_id to your table and make it the partition key portion of your Primary Key. If you do this and then add invite_date as a cluster column to the partition key then all records for that group_id will be stored in the same partition and the Order By will work. Order By only works within the same partiton, not across partitions which is what you were asking it to do.
CREATE TABLE invites_by_group (
group_id text,
email_address text,
invite_date timestamp,
PRIMARY KEY (group_id, invite_date));
I think you should use group_id and email as primary key.
For ordering, maybe you could do it in your application.
CREATE TABLE invites ( group_id text,
email_address text,
invite_date timestamp,
PRIMARY KEY (group_id, email_address) );
For cassandra data modeling, it's a good start to watch DS220

Schema design for top ranked post query in cassandra

I have requirement where I need to find top ranked pictures in chronological order from certain city. I came up with below schema
create table top_picture(
picture_id uuid,
city text,
rank int,
date timestamp,
primary key (city,date,rank)
) with CLUSTERING ORDER BY (date desc,rank desc);
It does solve problem to some extent (apart from duplicates) by executing following query
select * from top_picture where city='san diego';
. But if same picture_id is inserted in same day then I get duplicate entries as picture_id is not part of partition key. However I can not add it to partitioning key because then I won't be able make simple selection query like above as I would need to provide picture_id with selection query and it won't give top pics for city.
Did anyone came accross this type of schema before or any other recommended ways to do it.
It sounds like you want two views of the data. In one view you want to get the top ranked pictures and in the other view you want the picture_id to be unique.
So you could have two tables, with one that has picture_id as the primary key and the other as you have shown.
When you have a picture to insert, you would first try to insert it into the picture_id table using the IF NOT EXISTS clause on the insert statement. If that insert fails, then it is a duplicate and you would not insert it into the top_picture table.
In Cassandra 3.0 there is going to be support for materialized views like this, but for now you would have to manage both tables in your application code.

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Resources