I am exploring Apache Ignite on top of Cassandra as a possible tool to be able to give ad-hoc queries on cassandra tables. Using Ignite is it
possible to able to search or query on any column in the underlying cassandra tables, like a RDBMS? Or can the join columns and search
columns only be partition and clustering columns ?
If using Ignite, is there still need to create indexes on cassandra ? Also how does ignite treat materialized views ? Will there be a need
to create materialized views ?
Also any insights into how updates to cassandra release can/will be handled by Ignite would be very helpful.
I will elaborate my question further:
Customer table:
CREATE TABLE customer (
customer_id INT,
joined_date date,
name text,
address TEXT,
is_active boolean,
created_by text,
updated_by text,
last_updated timestamp,
PRIMARY KEY(customer_id, joined_date)
);
Product table:
CREATE TABLE PDT_BY_ID (
device_id uuid,
desc text,
serial_number text,
common_name text,
customer_id int,
manu_name text,
last_updated timestamp,
model_number text,
price double,
PRIMARY KEY((device_id), serial_number)
) WITH CLUSTERING ORDER BY (serial_number ASC);
A join is possible on these tables using apache Ignite.
But is the join possible on non-primary keys ?
Is it possible for example, to give queries on product table like 'where customer_id = ... AND model_number like = '%ABC%' ' etc. ?
Is it possible to give RDBMS like queries where one can give conditions on any columns ?
Run ad-hoc queries on the tables ?
This is discussed on Apache Ignite forum: http://apache-ignite-users.70518.x6.nabble.com/Newbie-Questions-on-Ignite-over-cassandra-td10264.html
Related
Is it possible to limit text size in cassandra like you would do in sql when creating a table?
username character varying(20)
My CQL query:
CREATE TABLE users(user_id uuid PRIMARY KEY, username text, date_created
bigint, profile_pic text, num_followers integer, name text);
Is it possible to limit text size in cassandra like you would do in sql when creating a table?
No, Cassandra does not allow you to limit the size of a VARCHAR/TEXT when creating a table.
user_id uuid PRIMARY KEY,
Out of curiosity, why is user_id (a UUID) the sole PRIMARY KEY? Do you need to support a lot of queries by user_id?
If not, then you should consider switching it to partition on something that provides a little more query flexibility, and maybe use user_id as a clustering key (to ensure uniqueness).
Have two table as below :
CREATE TABLE model_vals (
model_id int,
data_item_code text,
date date,
data_item text,
pre_cal1 text,
pre_cal2 text,
pre_cal3 text,
pre_cal4 text,
pre_cal5 text,
pre_cal6 text,
PRIMARY KEY (( model_id, data_item ), date)
) WITH CLUSTERING ORDER BY ( date DESC )
CREATE TABLE prapre_calulated_vals (
id int,
precal_code text,
date date,
precal_item text,
pre_cal1 text,
pre_cal2 text,
pre_cal3 text,
pre_cal4 text,
pre_cal5 text,
pre_cal6 text,
PRIMARY KEY (( id, precal_item ), date)
) WITH CLUSTERING ORDER BY ( date DESC )
After processing input data from Kafka , using spark-sql, the result data is inserted into first (model_vals) C* table. Which further serve some web-service end points.
Another business logic need data from above first(model_vals) C* table, process it an populate restuls in second (prapre_calulated_vals) C* table.
For web-service endpoint , end-user can pass require where condition and get the data from first(model_vals) C* table.
But further processing I need to read the entire first(model_vals) C* table,
process the data , do other set of calculation and populate second (prapre_calulated_vals) C* table.
First(model_vals) C* table has million of records , so we cant load the entire table at a time to process ..
How to handle this scenario in C* ? What alternatives I have to handle this situation ?
You have several options depending on the complexity of what you need done. In general it sounds like you need some sort of streaming framework that simultaneously with writing new data to your records, also does some business logic and writes to a second table.
Some technologies that come to mind are,
Spark Streaming
Flink
Apex
All of these technologies have connectors for Cassandra that enable reading both entire tables as well as portions of tables in efficient manners for doing joins with new data. Of course this will be slower than aggregation techniques on flat files or doing smaller requests of tiny amounts of data.
If you don't need a streaming approach, since you are already using Spark, I would suggest using a subsequent SparkSQL query to populate your final table.
I created database "Movies" with three column families:
CREATE TABLE movies (
movie_id int primary key,
title text,
avg_rating decimal,
total_ratings int,
genres set<text>
);
# shows all ratings for specific movie
CREATE TABLE ratings_by_movie (
movie_id int,
user_id int,
rating decimal,
ts int,
primary key(movie_id, user_id)
);
# show all ratings of specific user
CREATE TABLE ratings_by_user (
user_id int,
movie_id int,
rating decimal,
ts int,
primary key(user_id, movie_id)
);
Is it possible to make the following queries?
Show the movie with the most reviews
Show all movies with the average rating >= 4
Show 100 best movies based on their ratings
Cassandra = No Joins. Your model is 100% relational. You need to rethink it for Cassandra. I would advice you take a look at these slides. They dig deep into how to model data for cassandra. Also here is a webinar covering the topic. But stop thinking foreign keys and joining tables, because if you need relations cassandra isn't the tool for the job.
But Why?
Because then you need to check consistency and do many other things that relational databases do and so you loose the performance and scalability that cassandra offers.
What can I do?
DENORMALIZE! Lots of data in one table? But the table will have too many columns!
So? Cassandra can handle a very large number of columns in a table.
For more details check: How to do a join queries with 2 or more tables in cassandra cql
I tried reading up on datastax blogs and documentation but could not find any specific on this
Is there a way to keep 2 tables in Cassandra to belong to same partition?
For example:
CREATE TYPE addr (
street_address1 text,
city text,
state text,
country text,
zip_code text,
);
CREATE TABLE foo (
account_id timeuuid,
data text,
site_id int,
PRIMARY KEY (account_id)
};
CREATE TABLE bar (
account_id timeuuid,
address_id int,
address frozen<addr>,
PRIMARY KEY (account_id, address_id)
);
Here I need to ensure that both of these tables/CF will live on same partition that way for the same account_id both of these set of data can be fetched from the same node
Any pointers are highly appreciated.
Also, if someone has some experience in using UDT (User Defined Types), I would like to understand how the backward compatibility would work. If I modify "addr" UDT to have a couple of more attributes (say for example zip_code2 int, and name text), how does the older rows that does have these attribute work? Is it even compatible?
Thanks
If two table share the same replication strategy and same partition key they will colocate their partitions. So as long as the two tables are in the same keyspace AND their partition keys match
PRIMARY KEY (account_id) == PRIMARY KEY (account_id, address_id)
Any given account_id will be on (and replicated to) the same machines.
I am using Cassandra for the first time in a web app and I got a query problem.
Here is my tab :
CREATE TABLE vote (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), user_id, schedule_id)
);
On every request, I indicate my partition key, doodle_id.
For example I can make without any problems :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and user_id = 97a7378a-e1bb-4586-ada1-177016405142;
But on the last request I made :
select * from vote where doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7 and schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
I got the following error :
Bad Request: PRIMARY KEY column "schedule_id" cannot be restricted (preceding column "user_id" is either not restricted or by a non-EQ relation)
I'm new with Cassandra, but correct me if I'm wrong, in a composite primary key, the first part is the PARTITION KEY which is mandatory to allow Cassandra to know where to look for data.
Then the others parts are CLUSTERING KEY to sort data.
But I still don't get why my first request is working and not the second one ?
If anyone could help it will be a great pleasure.
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your second query (queries by doodle_id and schedule_id, but not necessarilly with user_id), is to create a new table to handle that specific query. This table will be pretty much the same, except the PRIMARY KEY will be slightly different:
CREATE TABLE votebydoodleandschedule (
doodle_id uuid,
user_id uuid,
schedule_id uuid,
vote int,
PRIMARY KEY ((doodle_id), schedule_id, user_id)
);
Now this query will work:
SELECT * FROM votebydoodleandschedule
WHERE doodle_id = c4778a27-f2ca-4c96-8669-15dcbd5d34a7
AND schedule_id = c37df0ad-f61d-463e-bdcc-a97586bea633;
This gets you around having to specify ALLOW FILTERING. Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster.
The clustering key is also used to find the columns within a given partition. With your model, you'll be able to query by:
doodle_id
doodle_id/user_id
doodle_id/user_id/schedule_id
user_id using ALLOW FILTERING
user_id/schedule_id using ALLOW FILTERING
You can see your primary key as a file path doodle_id#123/user_id#456/schedule_id#789 where all data is stored in the deepest folder (ie schedule_id#789). When you're querying you have to indicate the subfolder/subtree from where you start searching.
Your 2nd query doesn't work because of how columns are organized within partition. Cassandra can not get a continuous slice of columns in the partition because they are interleaved.
You should invert the primary key order (doodle_id, schedule_id, user_id) to be able to run your query.