Cassandra (Datastax ) CQL ignore case of TEXT column - cassandra

I have created table using below CQL: I want to run query to find all video by actor name (case insensitive).
CREATE TABLE video_by_actor(
actor text, added_date timestamp, video_id timeuuid,
character_name text, description text,
encoding frozen<video_encoding>,
tags set<text>, title text, user_id uuid,
primary key ((actor), added_date)) with clustering order by (added_date desc);
select * from video_by_actor where actor='Tom Hanks'
I want to select all rows from table irrespective of actor's name case eg. "tom hanks," "Tom hanks," "tom Hanks" etc.
Is it possible?

I want to search all case
First of all, if you want to "search," you need a different tool, like ElasticSearch. Cassandra is for key-based querying, which is very different from searching.
No, what you're looking to do really isn't possible with Cassandra, as it cares about case. I created the table definition described above, and inserted four rows, each with a different case application to Tom Hanks' name. Then I queried the results with the token function:
aploetz#cqlsh:stackoverflow> SELECT actor,token(actor),title FROM video_by_actor ;
actor | system.token(actor) | title
-----------+----------------------+---------------------
Tom Hanks | -4258050846863339499 | Forrest Gump
Tom hanks | -3872727890651172910 | Saving Private Ryan
tom Hanks | -3300209463718095087 | Joe vs. the Volcano
tom hanks | 1022609553103151654 | Apollo 13
(4 rows)
Notice how each different case of "Tom Hanks" generated a different token. As this table is partitioning on actor, this means that these rows will likely be stored on different nodes.
Again, you'll probably want to use an actual search engine for something like this. They will have tools like analyzers that can have features like "fuzzy matching" enabled.

Related

Cassandra - CQL queries [COUNT, ORDER_BY, GROUP_BY ]

I'm new in Cassandra and I'm trying to learn a bit more of how this DB engine works (specially the CQL part ) and compare it with Mysql.
With this in mind I was trying some query's, but there is one particular query that I can't figure out.
From what I could read it seams that it's not possible to do this query in Cassandra, but I would like to know for sure if there is somework around that.
Imagine the following table [Customer] with PRIMARY_KEY = id:
id, name, city, country, email
01, Jhon, NY, USA, jhon#
02, Mary, DC, USA, mary#
03, Smith, L, UK, smith#
.....
I want to get a listing that shows me how many customers I have per country and ORDER BY DESC.
In mySQL it would be something like
SELECT COUNT(Id), country
FROM customer
GROUP BY country
ORDER BY COUNT(Id) DESC
But in Cassandra (CQL) it seems that I can't do GROUP BY of columns that aren't PRIMARY_KEY (like the case of "country" ), is there anyway arround this ???
You need to define a secondary index on "country". Secondary indexes are used to query a table using a column that is not normally query table.
For ORDER BY you define clustering keys on 'id'.Clustering keys are responsible for sorting data within a partition.
The main thing to remember when building a table in Cassandra, is to model its PRIMARY KEY based on how you plan to query it. In any case, defining id as the PRIMARY KEY isn't very helpful for what you're trying to do.
Also, keywords like GROUP BY and ORDER BY have special requirements. ORDER BY specifically is pretty useless (IMO), unless you plan to reverse the sort direction. But you cannot pick an arbitrary column to sort your data by.
To solve for your query above, I'll create a new table, keyed on the country, city, and id columns (in that order):
CREATE TABLE customer_by_city (
id TEXT,
name TEXT,
city TEXT,
country TEXT,
email TEXT,
PRIMARY KEY (country,city,id)
) WITH CLUSTERING ORDER BY (city ASC, id DESC);
Now, I'll INSERT the rows:
INSERT INTO customer_by_city (id,name,city,country,email)
VALUES ('01', 'Jhon', 'NY', 'USA', 'jhon#gmail.com');
INSERT INTO customer_by_city (id,name,city,country,email)
VALUES ('02', 'Mary', 'DC', 'USA', 'mary#gmail.com');
INSERT INTO customer_by_city (id,name,city,country,email)
VALUES ('03', 'Smith', 'London', 'UK', 'smith#gmail.com');
SELECT COUNT(Id), country FROM customer_by_city GROUP BY country ;
system.count(id) | country
------------------+---------
2 | USA
1 | UK
(2 rows)
Warnings :
Aggregation query used without partition key
Notes:
That last message means you're running a query without a WHERE clause keyed by the partition key. That means Cassandra is going to have to check every node in the cluster to serve this query. Highly inefficient.
While it works for this example, country as a partition key may not be the best way to distribute data. After all, if most of the customers are in one particular country, then they could potentially push the bounds of the maximum partition size.

Cassandra data modeling for a social network

We are using Datastax Cassandra for our social network and we are designing/data modeling tables we need, it is confusing for us and we don't know how to design some tables and we have some little problems!
As we understood for every query we have to have different tables, and for example user A is following user C and B.
Now, in Cassandra we have a table that is posts_by_user:
user_id | post_id | text | created_on | deleted | view_count
likes_count | comments_count | user_full_name
And we have a table according to the followers of users, we insert the post's info to the table called user_timeline that when the follower users are visiting the first web page we get the post from database from user_timeline table.
And here is user_timeline table:
follower_id | post_id | user_id (who posted) | likes_count |
comments_count | location_name | user_full_name
First, Is this data modeling correct for follow base (follower, following actions) social network?
And now we want to count likes of a post, as you see we have number of likes in both tables (user_timeline, posts_by_user), and imagine one user has 1000 followers then by each like action we have to update all 1000 rows in user_timeline and 1 row in posts_by_users; And this is not logical!
Then, my second question is How should it be? I mean how should like (favorite) table be?
Think of using posts_by_user as metadata for a post's information. This would allow you to house user_id, post_id, message_text, etc, but you would abstract the view_count, likes_count, and comments_count into a counter table. This would allow you to fetch either a post's metadata or counters as long as you had the post_id, but you would only have to update the counter_record once.
DSE Counter Documentation:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html
However,
The article below is a really good starting point in relation to data modeling for Cassandra. Namely, there are a few things to take into consideration when answering this question, many of which will depend on the internals of your system and how your queries are structured.
The first two rules are stated as:
Rule 1: Spread Data Evenly Around the Cluster
Rule 2: Minimize the Number of Partitions Read
Taking a moment to consider the "user_timeline" table.
user_id and created_on as a COMPOUND KEY* - This would be ideal if
You wanted to query for posts by a certain user and with the assumption that you would have a decent number of users. This would
distribute records evenly, and your queries would only be hitting a
partition at a time.
user_id and a hash_prefix as a COMPOUND KEY* - This would be ideal
if
You had a small number of users with a large number of posts, which would allow your data to be evenly spread across the
cluster. However you run the risk of having to query across
multiple partitions.
follower_id and created_on as a COMPOUND KEY* - This would be ideal
if
You wanted to query for posts being followed by a certain follower. The records would be distributed and you would minimize
queries across partitions
These were 3 examples for 1 table, and the point I wanted to convey is to design your tables around the queries you want to execute. Also don't be afraid to duplicate your data across multiple tables that are setup to handle various queries, this is the way Cassandra was meant to be modeled. Take a bit to read the article below and watch the DataStax Academy Data Modeling Course, to familiarize yourself with the nuances. I also included an example schema below to cover the basic counter schema I was pointing out earlier.
* The reason for the compound key is due to the fact that your PRIMARY KEY has to be unique, otherwise an INSERT with an existing PRIMARY KEY will become an UPDATE.
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
https://academy.datastax.com/courses
CREATE TABLE IF NOT EXISTS social_media.posts_by_user (
user_id uuid,
post_id uuid,
message_text text,
created_on timestamp,
deleted boolean,
user_full_name text,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.user_timeline (
follower_id uuid,
post_id uuid,
user_id uuid,
location_name text,
user_full_name text,
created_on timestamp,
PRIMARY KEY ((user_id, created_on))
);
CREATE TABLE IF NOT EXISTS social_media.post_counts (
likes_count counter,
view_count counter,
comments_count counter,
post_id uuid,
PRIMARY KEY (post_id)
);

Design timeseries schema for cassandra

My application want to display photos uploaded for a day in descending manner.
I looked at weather station example for cassandra where i get timeseries data for particular station. In my case i want to track all photos present in system. I have designed schema like below:
create table if not exists photos(
photo_id uuid,
category text,
owner uuid,
date text,
created timestamp,
primary key((date),created)
)WITH CLUSTERING ORDER BY (created DESC);
Here date is MM/DD/YYYY string representation of created date.
The problem here is when I do select I want latest photo based on created date. I get back rows in random order (well they are ordered in desc order if they have same date). I want to fetch rows for latest date when I do select.
The problem here is when I do select I want latest photo based on created date. I get back rows in random order
Actually, they are in order by the hashed value of your partition key (date). Cassandra can only maintain clustering order within a partition key. This is why rows with the same created are sorted "if they have the same date."
I want to fetch rows for latest date when I do select.
You can do that. All you need to do is specify a date when you do it.
SELECT * FROM photos WHERE date='03/28/2015';
By restricting your partition key, your rows will be returned in their defined clustering order. From your application or reporting level, generating the current date shouldn't be too hard to do.
Also, not to self-promote, but earlier this month Planet Cassandra posted an article that I wrote on this subject (largely based on questions I have answered on this site): We Shall Have Order! Give that a read and it should help you with these types of problems.
Please try " Order by " in select operation. It will bring the data as per date
Below example shows the value of photos based-on created date in ascending order.
cqlsh:temp> SELECT * FROM photos WHERE created in (1427524795784,1427524795899) and date = 'march-28' ORDER BY created ASC ;
date | created | category | owner | photo_id
----------+--------------------------+----------+--------------------------------------+--------------------------------------
march-28 | 2015-03-28 10:39:55+0400 | nature | 007aa9b5-c86b-4805-a65d-6019d1ba820b | 007aa9b5-c86b-4805-a65d-6019d1ba820b
march-28 | 2015-03-28 10:39:55+0400 | nature | 007aa9b5-c86b-4805-a65d-6019d1ba820b | 007aa9b5-c86b-4805-a65d-6019d1ba820b

Get Date Range for Cassandra - Select timeuuid with IN returning 0 rows

I'm trying to get data from a date range on Cassandra, the table is like this:
CREATE TABLE test6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((time), id)
)
But when I select a data range I get nothing:
SELECT * FROM teste WHERE time IN ( minTimeuuid('2013-01-01 00:05+0000'), now() );
(0 rows)
How can I get a date range from a Cassandra Query?
The IN condition is used to specify multiple keys for a SELECT query. To run a date range query for your table, (you're close) but you'll want to use greater-than and less-than.
Of course, you can't run a greater-than/less-than query on a partition key, so you'll need to flip your keys for this to work. This also means that you'll need to specify your id in the WHERE clause, as well:
CREATE TABLE teste6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((id), time)
)
INSERT INTO teste6 (time,id,checked,email,name)
VALUES (now(),'B26354',true,'rdeckard#lapd.gov','Rick Deckard');
SELECT * FROM teste6
WHERE id='B26354'
AND time >= minTimeuuid('2013-01-01 00:05+0000')
AND time <= now();
id | time | checked | email | name
--------+--------------------------------------+---------+-------------------+--------------
B26354 | bf0711f0-b87a-11e4-9dbe-21b264d4c94d | True | rdeckard#lapd.gov | Rick Deckard
(1 rows)
Now while this will technically work, partitioning your data by id might not work for your application. So you may need to put some more thought behind your data model and come up with a better partition key.
Edit:
Remember with Cassandra, the idea is to get a handle on what kind of queries you need to be able to fulfill. Then build your data model around that. Your original table structure might work well for a relational database, but in Cassandra that type of model actually makes it difficult to query your data in the way that you're asking.
Take a look at the modifications that I have made to your table (basically, I just reversed your partition and clustering keys). If you still need help, Patrick McFadin (DataStax's Chief Evangelist) wrote a really good article called Getting Started with Time Series Data Modeling. He has three examples that are similar to yours. In fact his first one is very similar to what I have suggested for you here.

Does CQL3 require a schema for Cassandra now?

I've just had a crash course of Cassandra over the last week and went from Thrift API to CQL to grokking SuperColumns to learning I shouldn't use them and user Composite Keys instead.
I'm now trying out CQL3 and it would appear that I can no longer insert into columns that are not defined in the schema, or see those columns in a select *
Am I missing some option to enable this in CQL3 or does it expect me to define every column in the schema (defeating the purpose of wide, flexible rows, imho).
Yes, CQL3 does require columns to be declared before used.
But, you can do as many ALTERs as you want, no locking or performance hit is entailed.
That said, most of the places that you'd use "dynamic columns" in earlier C* versions are better served by a Map in C* 1.2.
I suggest you to explore composite columns with "WITH COMPACT STORAGE".
A "COMPACT STORAGE" column family allows you to practically only define key columns:
Example:
CREATE TABLE entities_cargo (
entity_id ascii,
item_id ascii,
qt ascii,
PRIMARY KEY (entity_id, item_id)
) WITH COMPACT STORAGE
Actually, when you insert different values from itemid, you dont add a row with entity_id,item_id and qt, but you add a column with name (item_id content) and value (qt content).
So:
insert into entities_cargo (entity_id,item_id,qt) values(100,'oggetto 1',3);
insert into entities_cargo (entity_id,item_id,qt) values(100,'oggetto 2',3);
Now, here is how you see this rows in CQL3:
cqlsh:goh_master> select * from entities_cargo where entity_id = 100;
entity_id | item_id | qt
-----------+-----------+----
100 | oggetto 1 | 3
100 | oggetto 2 | 3
And how they are if you check tnem from cli:
[default#goh_master] get entities_cargo[100];
=> (column=oggetto 1, value=3, timestamp=1349853780838000)
=> (column=oggetto 2, value=3, timestamp=1349853784172000)
Returned 2 results.
You can access a single column with
select * from entities_cargo where entity_id = 100 and item_id = 'oggetto 1';
Hope it helps
Cassandra still allows using wide rows. This answer references that DataStax blog entry, written after the question was asked, which details the links between CQL and the underlying architecture.
Legacy support
A dynamic column family defined through Thrift with the following command (notice there is no column-specific metadata):
create column family clicks
with key_validation_class = UTF8Type
and comparator = DateType
and default_validation_class = UTF8Type
Here is the exact equivalent in CQL:
CREATE TABLE clicks (
key text,
column1 timestamp,
value text,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
Both of these commands create a wide-row column family that stores records ordered by date.
CQL Extras
In addition, CQL provides the ability to assign labels to the row id, column and value elements to indicate what is being stored. The following, alternative way of defining this same structure in CQL, highlights this feature on DataStax's example - a column family used for storing users' clicks on a website, ordered by time:
CREATE TABLE clicks (
user_id text,
time timestamp,
url text,
PRIMARY KEY (user_id, time)
) WITH COMPACT STORAGE
Notes
a Table in CQL is always mapped to a Column Family in Thrift
the CQL driver uses the first element of the primary key definition as the row key
Composite Columns are used to implement the extra columns that one can define in CQL
using WITH COMPACT STORAGE is not recommended for new designs because it fixes the number of possible columns. In other words, ALTER TABLE ... ADD is not possible on such a table. Just leave it out unless it's absolutely necessary.
interesting, something I didn't know about CQL3. In PlayOrm, the idea is it is a "partial" schema you must define and in the WHERE clause of the select, you can only use stuff that is defined in the partial schema BUT it returns ALL the data of the rows EVEN the data it does not know about....I would expect that CQL should have been doing the same :( I need to look into this now.
thanks,
Dean

Resources