How to index high cardinality column in cassandra - cassandra

I have a column of high cardinality and i need to index that column, because i have to perform range queries on that column. I know that secondary indexes are not fit for high cardinality column in cassandra, so i tried to create materialized view on that table with that column as partition key, but range queries are not working without allow filtering on that view. Always perform allow filtering queries is not a good practice for large amount of data. What schema should I use?
CREATE TABLE testing_status.input_log_profile_1 (
cid text,
ctdon bigint,
ctdat bigint,
email text,
addrs set<frozen<udt_addrs>>,
asset set<frozen<udt_asset>>,
cntno set<frozen<udt_cntno>>,
dob frozen<udt_date>,
dvc set<frozen<udt_dvc>>,
eaka set<text>,
edmn text,
educ set<frozen<udt_educ>>,
error_status text,
gen tinyint,
hobby set<text>,
income set<frozen<udt_income>>,
interest set<text>,
lang set<frozen<udt_lang>>,
levnt set<frozen<udt_levnt>>,
like map<text, frozen<set<text>>>,
loc set<frozen<udt_loc>>,
mapp set<text>,
name frozen<udt_name>,
params map<text, frozen<set<text>>>,
prfsn set<frozen<udt_prfsn>>,
rel set<frozen<udt_rel>>,
rel_s tinyint,
skills_prfsn set<frozen<udt_skill_prfsn>>,
snw set<frozen<udt_snw>>,
sport set<text>,
status_id tinyint,
PRIMARY KEY (cid, ctdon, ctdat, email)
) WITH CLUSTERING ORDER BY (ctdon ASC, ctdat ASC, email ASC)
CREATE CUSTOM INDEX status_idx ON testing_status.input_log_profile_1 (status_id) USING 'org.apache.cassandra.index.sasi.SASIIndex';
CREATE CUSTOM INDEX err_idx ON testing_status.input_log_profile_1 (error_status) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false'};
where status_id contains sequence of ids which is high cardinality field.
And my query is like
select * from input_log_profile_1 where cid='1_1' and status_id >= 1 and status_id <= 100 ;

Related

Not able to run multiple where clause without Cassandra allow filtering

Hi I am new to Cassandra.
We are working on IOT project where car sensor data will be stored in cassandra.
Here is the example of one table where I am going to store one of the sensor data.
This is some sample data.
The way I want to partition the data is based on the organization_id so that different organization data is partitioned.
Here is the create table command:
CREATE TABLE IF NOT EXISTS engine_speed (
id UUID,
engine_speed_rpm text,
position int,
vin_number text,
last_updated timestamp,
organization_id int,
odometer int,
PRIMARY KEY ((id, organization_id), vin_number)
);
This works fine. However all my queries will be as bellow:
select * from engine_speed
where vin_number='xyz'
and organization_id = 1
and last_updated >='from time stamp' and last_updated <='to timestamp'
Almost all queries in all the table will have similar / same where clause.
I am getting error and it is asking to add "Allow filtering".
Kindly let me know how do I partition the table and define right primary key and indexs so that I don't have to add "allow filtering" in the query.
Apologies for this basic question but I'm just starting using cassandra.(using apache cassandra:3.11.12 )
The order of where clause should match with the order of partition and clustering keys you have defined in your DDL and you cannot skip any part of primary key while applying the WHERE clause before using the next key. So as per the query pattern u have defined, you can try the below DDL:
CREATE TABLE IF NOT EXISTS autonostix360.engine_speed (
vin_number text,
organization_id int,
last_updated timestamp,
id UUID,
engine_speed_rpm text,
position int,
odometer int,
PRIMARY KEY ((vin_number, organization_id), last_updated)
);
But remember,
PRIMARY KEY ((vin_number, organization_id), last_updated)
PRIMARY KEY ((vin_number), organization_id, last_updated)
above two are different in Cassandra, In case 1 your data will be partitioned by combination of vin_number and organization_id while last_updated will act as ordering key. In case 2, your data will be partitioned only by vin_number while organization_id and last_updated will act as ordering key. So you need to figure out which case suits your use case.

viewing as list in cassandra

Table
CREATE TABLE vehicle_details (
owner_name text,
vehicle list<text>,
price float,
vehicle_type text,
PRIMARY KEY(price , vehicle_type)
)
I have two issues over here
I am trying to view the list of the vehicle per user. If owner1 has 2 cars then it should show as owner_name1 vehicle1 & owner_name1 vehicle2. is it possible to do with a select query?
The output I am expecting
owner_name_1 | vehicle_1
owner_name_1 | vehicle_2
owner_name_2 | vehicle_1
owner_name_2 | vehicle_2
owner_name_2 | vehicle_3
I am trying to use owner_name in the primary key but whenever I use WHERE or DISTINCT or ORDER BY it does not work properly. I am going to query price, vehicle_type most of the time. but Owner_name would be unique hence I am trying to use it. I tried several combinations.
Below are three combinations I tried.
PRIMARY KEY(owner_name, price, vehicle_type) WITH CLUSTERING ORDER BY (price)
PRIMARY KEY((owner_name, price), vehicle_type)
PRIMARY KEY((owner_name, vehicle_type), price) WITH CLUSTERING ORDER BY (price)
Queries I am running
SELECT owner_name, vprice, vehicle_type from vehicle_details WHERE vehicle_type='SUV';
SELECT Owner_name, vprice, vehicle_type from vehicle_details WHERE vehicle_type='SUV' ORDER BY price desc;
Since your table has:
PRIMARY KEY(price , vehicle_type)
you can only run queries with filters on the partition key (price) or the partition key + clustering column (price + vehicle_type):
SELECT ... FROM ... WHERE price = ?
SELECT ... FROM ... WHERE price = ? AND vehicle_type = ?
If you want to be able to query by owner name, you need to create a new table which is partitioned by owner_name. I also recommend not storing the vehicle in a collection:
CREATE TABLE vehicles_by_owner
owner_name text,
vehicle text,
...
PRIMARY KEY (owner_name, vehicle)
)
By using vehicle as a clustering column, each owner will have rows of vehicles in the table. Cheers!

SASI indexes on year and month

I am new to SASI indexes in Cassandra and I am unclear how they index when multiple columns are included in the "where" predicate that are indexed.
Here is one option I am looking at:
Option 1:
CREATE TABLE IF NOT EXISTS my_timeseries_data (
id text,
event_time timestamp,
value text,
year int,
month int,
PRIMARY KEY (id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
CREATE CUSTOM INDEX year_idx ON my_timeseries_data (year)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'mode': 'SPARSE' };
CREATE CUSTOM INDEX month_idx ON my_timeseries_data (month)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'mode': 'SPARSE' };
I expect to query like this sometimes:
select * from my_timeseries_data
where year = 2016 and month = 1 ALLOW FILTERING;
Does the SASI index on 'month' column help my performance?
Option 2:
Would it be better to index a concatenated column like 'year_and_month' below?
CREATE TABLE IF NOT EXISTS my_timeseries_data (
id text,
event_time timestamp,
value text,
year_and_month text,
PRIMARY KEY (id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
CREATE CUSTOM INDEX year_idx ON my_timeseries_data (year_and_month)
USING 'org.apache.cassandra.index.sasi.SASIIndex';
And then query like this on a single SASI index:
select * from my_timeseries_data
where year_and_month = '2016_1';
Option 3:
NO need for extra month and year columns and SASI indexes because having 'event_time' as a CLUSTERING COLUMN allows scalable time-range queries that I want to do anway?

Apache Cassandra table not sorting by name or title correctly

I have the following Apache Cassandra Table working.
CREATE TABLE user_songs (
member_id int,
song_id int,
title text,
timestamp timeuuid,
album_id int,
album_title text,
artist_names set<text>,
PRIMARY KEY ((member_id, song_id), title)
) WITH CLUSTERING ORDER BY (title ASC)
CREATE INDEX user_songs_member_id_idx ON music.user_songs (member_id);
When I try to do a select * FROM user_songs WHERE member_id = 1; I thought the Clustering Order by title would have given me a sorted ASC of the return - but it doesn't
Two questions:
Is there something with the table in terms of ordering or PK?
Do I need more tables for my needs in order to have a sorted title by member_id.
Note - my Cassandra queries for this table are:
Find all songs with member_id
Remove a song from memeber_id given song_id
Hence why the PK is composite
UPDATE
It is simialr to: Query results not ordered despite WITH CLUSTERING ORDER BY
However one of the suggestion in the comments is to put member_id,song_id,title as primary instead of the composite that I currently have. When I do that It seems that I cannot delete with only song_id and member_id which is the data that I get for deleting (hence title is missing when deleting)

Non-EQ relation error Cassandra - how fix primary key?

I created a one table posts. When I make request SELECT:
return $this->db->query('SELECT * FROM "posts" WHERE "id" IN(:id) LIMIT '.$this->limit_per_page, ['id' => $id]);
I get error:
PRIMARY KEY column "id" cannot be restricted (preceding column
"post_at" is either not restricted or by a non-EQ relation)
My table dump is:
CREATE TABLE posts (
id uuid,
post_at timestamp,
user_id bigint,
name text,
category set<text>,
link varchar,
image set<varchar>,
video set<varchar>,
content map<text, text>,
private boolean,
PRIMARY KEY (user_id,post_at,id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
I read some article about PRIMARY AND CLUSTER KEYS, and understood, when there are some primary keys - I need use operator = with IN. In my case, i can not use a one PRIMARY KEY. What you advise me to change in table structure, that error will disappear?
My dummy table structure
CREATE TABLE posts (
id timeuuid,
post_at timestamp,
user_id bigint,
PRIMARY KEY (id,post_at,user_id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
And after inserting some dummy data
I ran query select * from posts where id in (timeuuid1,timeuuid2,timeuuid3);
I was using cassandra 2.0 with cql 3.0

Resources