I'm working with the movielens dataset and I have a column called 'genres' which has entries such as 'Action|War', 'Action|Adventure|Comedy|Sci-Fi'. I wish to count the number of rows that have the text 'Comedy' in them.
SELECT COUNT(*) FROM movielens.data_movies WHERE genres = 'Comedy' ALLOW FILTERING
But this counts only the exact instances of 'Comedy'. It does not count 'Action|Adventure|Comedy|Sci-Fi' which I want it to do. So I tried,
SELECT COUNT(*) FROM movielens.data_movies WHERE genres CONTAINS 'Comedy' ALLOW FILTERING
However, that gives me the error
Cannot use CONTAINS on non-collection column genres
From this it seems that there is no easy way to do what I'm asking. Does anyone know of a simpler solution?
So what you can do, is to create a CUSTOM index on genres.
CREATE CUSTOM INDEX ON movielens.data_movies(genres)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS={'mode':'CONTAINS'};
Then this query should work:
SELECT COUNT(*) FROM movies
WHERE genres LIKE '%Comedy%';
However, if you're running a query across millions of rows over multiple nodes, this query will likely timeout. This is because Cassandra has to poll multiple partitions and nodes to build the result set. Queries like this don't really work well in Cassandra.
The best way to solve for this, is to create a table partitioned by genre, like this:
CREATE TABLE movies_by_genre (
id int,
title TEXT,
genre TEXT,
PRIMARY KEY(genre,title,id));
This is of course also assuming that genres is split-out by each individual genre. But then this query would work:
SELECT COUNT(*) FROM movies_by_genre
WHERE genre = 'Comedy';
Related
I have a question to query to cassandra collection.
I want to make a query that work with collection search.
CREATE TABLE rd_db.test1 (
testcol3 frozen<set<text>> PRIMARY KEY,
testcol1 text,
testcol2 int
)
table structure is this...
and
this is the table contents.
in this situation, I want to make a cql query has alternative option values on set column.
if it is sql and testcol3 isn't collection,
select * from rd.db.test1 where testcol3 = 4 or testcol3 = 5
but it is cql and collection.. I try
select * from test1 where testcol3 contains '4' OR testcol3 contains '5' ALLOW FILTERING ;
select * from test1 where testcol3 IN ('4','5') ALLOW FILTERING ;
but this two query didn't work...
please help...
This won't work for you for multiple reasons:
there is no OR operation in CQL
you can do only full match on the value of partition key (testcol3)
although you may create secondary indexes for fields with collection type, it's impossible to create an index for values of partition key
You need to change data model, but you need to know the queries that you're executing in advance. From brief looking into your data model, I would suggest to rollout the set field into multiple rows, with individual fields corresponding individual partitions.
But I want to suggest to take DS201 & DS220 courses on DataStax Academy site for better understanding how Cassandra works, and how to model data for it.
I have a very simple data table. But after reading a lot of examples in the internet, I am still more and more confused how to solve the following scenario:
1) The Table
My data table looks like this (without defining the primayr key, as this is my understanding problem):
CREATE TABLE documents (
uid text,
created text,
data text
}
Now my goal is to have to different ways to select data.
2) Select by the UID:
SELECT * FROM documents
WHERE uid = ‘xxxx-yyyyy-zzzz’
3) Select by a date limit
SELECT * FROM documents
WHERE created >= ‘2015-06-05’
So my question is:
What should my table definition in Cassandra look like, so that I can perform these selections?
To achieve both queries, you would need two tables.
First one would look like:
CREATE TABLE documents (
uid text,
created text,
data text,
PRIMARY KEY (uid));
and you retrieve your data with: SELECT * FROM documents WHERE uid='xxxx-yyyy-zzzzz' Of course, uid must be unique. You might want to consider the uuid data type (instead of text)
Second one is more delicate. If you set your partition to the full date, you won't be able to do a range query, as range query is only available on the clustering column. So you need to find the sweet spot for your partition key in order to:
make sure a single partition won't be too large (max 100MB,
otherwise you will run into trouble)
satisfy your query requirements.
As an example:
CREATE TABLE documents_by_date (
year int,
month int,
day int,
uid text,
data text,
PRIMARY KEY ((year, month), day, uid);
This works fine if within a day, you don't have too many documents (so your partition don't grow too much). And this allows you to create queries such as: SELECT * FROM documents_by_date WHERE year=2018 and month=12 and day>=6 and day<=24; If you need to issue a range query across multiple months, you will need to issue multiple queries.
If your partition is too large due to the data field, you will need to remove it from documents_by_date. And use documents table to retrieve the data, given the uid you retreived from documents_by_date.
If your partition is still too large, you will need to add hour in the partition key of documents_by_date.
So overall, it's not a straightforward request, and you will need to find the right balance for yourself when defining your partition key.
If latency is not a huge concern, an alternative would be to use the stratio lucene cassandra plugin, and index your date.
Question does not specify how your data is going to be with respect user and create time. But since its a document, I am assuming that one user will be creating one document at one "created" time.
Below is the table definition you can use.
CREATE TABLE documents (
uid text,
created text,
data text
PRIMARY KEY (uid, created)
) WITH CLUSTERING ORDER BY (created DESC);
WITH CLUSTERING ORDER BY (created DESC) can help you get the data order by created for a given user.
For your first requirement you can query like given below.
SELECT * FROM documents WHERE uid = 'SEARCH_UID';
For your second requirement you can query like given below
SELECT * FROM documents WHERE created > '2018-04-10 11:32:00' ALLOW FILTERING;
Use of Allow Filtering should be used diligently as it scans all partitions. If we have to create a separate table with date as primary key, it becomes tricky if there are many documents being inserted at very same second. Clustering order works best for the requirements where documents for a given user need to be sorted by time.
I have a table like:
CREATE TABLE videos_by_tags (
tag text,
video_id uuid,
...
PRIMARY KEY ((tag) video_id)
)
which will support queries such as get videos by tagA. If in the future I need to support queries like get videos by tagA and tabB and tagC, how would I set up my cassandra tables?
Is there a specific way to set up tables for this scenario, or do I have to query in a specific way?
Or would I have to resort to grabbing an x of amount of results for each tag criteria (ie: 100 items from tagA, 100 items from tagB, etc) and manually determine the same, unique items from each set?
I am new to cassandra and am coming from Postgres. I was wondering if there is a way that I can get data from 2 different tables or column family and then return the results. I have this query
select p.fullname,p.picture s.post, s.id, s.comments, s.state, s.city FROM profiles as p INNER JOIN Chats as s ON(p.id==s.profile_id) WHERE s.latitudes>=28 AND 29>= s.latitudes AND s.longitudes
">=-21 AND -23>= s.longitudes
The query has 2 tables: Profiles and Chat and they both share a common field Chats.id==Proifles.profile_id it boils down to this basically return all rows where Chat ID is equal to Profiles id. I would like to keep it that way because now updating profiles are simple and would only need to update 1 row per profile update instead of de-normalizing everything and updating thousands of records. Any help or suggestions would be great
You have to design tables in way you won't need joins. Best practice is if your table matches exactly the use case it is used for.
Cassadra has a feature called shared static columns; this allows you to bind values with partition part of primary key. Thus, you can create "joined" version of table without duplicates.
CREATE TABLE t (
p_id uuid,
p_fullname text STATIC,
p_picture text STATIC,
s_id uuid,
s_post text,
s_comments text,
s_state text,
s_city text,
PRIMARY KEY (p_id, s_id)
);
I have requirement where I need to find top ranked pictures in chronological order from certain city. I came up with below schema
create table top_picture(
picture_id uuid,
city text,
rank int,
date timestamp,
primary key (city,date,rank)
) with CLUSTERING ORDER BY (date desc,rank desc);
It does solve problem to some extent (apart from duplicates) by executing following query
select * from top_picture where city='san diego';
. But if same picture_id is inserted in same day then I get duplicate entries as picture_id is not part of partition key. However I can not add it to partitioning key because then I won't be able make simple selection query like above as I would need to provide picture_id with selection query and it won't give top pics for city.
Did anyone came accross this type of schema before or any other recommended ways to do it.
It sounds like you want two views of the data. In one view you want to get the top ranked pictures and in the other view you want the picture_id to be unique.
So you could have two tables, with one that has picture_id as the primary key and the other as you have shown.
When you have a picture to insert, you would first try to insert it into the picture_id table using the IF NOT EXISTS clause on the insert statement. If that insert fails, then it is a duplicate and you would not insert it into the top_picture table.
In Cassandra 3.0 there is going to be support for materialized views like this, but for now you would have to manage both tables in your application code.