I have just started working on Cassandra.
I am bit confuse with the concept of secondary key.
From the definition I understood is indexing on the non key attribute of a table which is not sorted is secondary index.
So I have this table
CREATE TABLE IF NOT EXISTS userschema.user (id int,name text, address text, company text, PRIMARY KEY (id, name))
So If I create index like this
CREATE INDEX IF NOT EXISTS user_name_index ON userschema.user (name)
this should be secondary index.
But my requirement is to create index containing columns name , id , company.
How can I create a secondary index like this in Cassandra ?
I got this link which defines something of this short, but how come are these secondary indexes aren't they just table ?
These above user table is just the example not the actual one.
I am using Cassandra 3.0.9
id and name are already part of primary key.
So following queries will work
SELECT * FROM table WHERE id=1
SELECT * FROM table WHERE id=1 and name='some value'
SELECT * FROM table WHERE name='some value' ALLOW FILTERING (This is inefficeint)
You can create secondary index on company column
CREATE INDEX IF NOT EXISTS company_index ON userschema.user (company)
Now once secondary index is defined, it can be used in where clause along with primary key.
SELECT * FROM table WHERE id=1 and name='some value' and company='some value'
Though SELECT * FROM table WHERE company='some value' ALLOW FILTERING works it will be highly inefficient.
Before creating secondary index have look at When to use secondary index in cassandra
The link which you have referred mainly focuses on materialized views, in which we create virtual tables to execute the queries with non-primary keys.
Moreover, it seems you are creating secondary key on a Primary Key, which you have already defined in the creation of the table. Always remember that Secondary Index should be Non-Primary key.
To have a clear idea about the Secondary Indexes- Refer this https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSecondaryIndex.html
Now, Pros and cons of the alternative methods for the secondary index
1.Materialized views:
It will create new virtual tables and you should run the queries in a virtual table using the old Primary keys in old and original tables and new virtual Primary keys in the new materialized table. Any changes in data modification in the original old table will be reflected at materialized table. If you drop the materialized table, but the data will be created as tombstones whose gcc_graceseconds is 864000(10 days) default. Dropping the materialized table will not have any effect on original table.
2.ALLOW FILTERING:
It is highly inefficient and is not at all advised to use allow filtering as the latencies will be high and performance will be degraded.
If you want much more information, refer this link too How do secondary indexes work in Cassandra?
Correct me if I am wrong
Related
I want to use the IN clause for the non-primary key column in Cassandra. Is it possible? if it is not is there any alternate or suggestion?
Three possible solutions
Create a secondary index. This is not recommended due to performance problems.
See if you can designate that column in the existing table as part of the primary key
Create another denormalised table that table is optimised for your query. i.e data model by query pattern
Update:
And also even after you move that to primary key, operations with IN clause can be further optimised. I found this cassandra lookup by list of primary keys in java very useful
We are using Vitess database to scale and achieve Horizontal Sharding in mysql. is it possible to do the secondary shard in Vitess.
For eg:
Table 1 - Agency
(
AgencyID INT,
CreatedOn DATETIME
)
Table 2 - PayrollDetails
(
AgencyID INT FOREIGN KEY TO Agency Table,
PayrollID INT,
PayrollCreatedOn DATETIME
)
Now We sharded both the tables with AgencyID as a Sharding Key. but PayrollDetails table is very huge and it has more then 100 million of records. So now we are planning to shard PayrollDetails table again with the PayrollCreatedOn field and Primary Shard for both the tables should be with the Agency Key but payrollDetails table should shard with the both AgencyID and PayrollCreatedOn.How can we achieve it in Vitess?
Conceptually, the sharding key (primary vindex) is used to decide which shard a row goes to. So, it's not possible to have two sharding keys because they would dictate conflicting locations for the row.
If I understand correctly, you want to query the table using PayrollCreatedOn in the where clause, you can create a secondary Vindex. This will create a lookup table that points at where the row lives, and Vitess can exploit that. There's an explanation for this here: https://vitess.io/docs/reference/vindexes/. There is a new command called CreateLookupVindex that is capable of backfilling this lookup table. It's yet to be documented, though.
Vitess also lets you "materialize" a table by using a different primary vindex. In that case, the second table will be a real-time copy of the first table, but sharded differently. You can see a demo for this on the vitess front page (scroll down to the video).
I have some billions records with 15 fields, which I want to insert them into Cassandra (with Java api). Since my queries search key can be one of the five different fields of record (i.e search query on fields 3 or 7 or 8 or 13 or 14), so I have created 5 identical tables with different primary keys in Cassandra (similar the note that is mentioned in enter link description here).
Now I read a record (or a batch of the records) and call "inserting into Cassandra" 5 times.
I want to know is there a mechanism in Cassandra that makes me to call "inserting into Cassandra" one times and storing the record(s) into 5 tables automatically?
For example the record(s) stores in MemTable at once (from my code by inserting at once) and the Cassandra core stores them in 5 tables in SSTable?
Since Cassandra 3.0 there is support for materialized views that could help you. But you need to design your source table carefully, as there is a number of limitations on how you can change structure of the materialized views comparing to source table - most notably:
* you can add to primary key at most one column that isn't in the primary key of source table;
* materialized view's primary key should contain all components of primary key of source table, but you can use different order of columns in primary key.
* all columns of materialized view's primary key should be non-null.
More details on these limitations you can find in this blog post.
You also need to be careful with changing partition key to not to get the big partitions (but you may have the same problem if you write data manually). Also, take into account that this adds more load to coordinator node that will need to distribute data to other servers if partition key is changed - when you write data "manually" then driver will send request directly to replica that holds that data.
Syntax for creation of materialized views is in the documentation - it quite similar to SQL's but not exactly (example from documentation):
CREATE TABLE cyclist_mv (cid UUID PRIMARY KEY,
name text, age int, birthday date, country text);
CREATE MATERIALIZED VIEW cyclist_by_age
AS SELECT age, birthday, name, country
FROM cyclist_mv
WHERE age IS NOT NULL AND cid IS NOT NULL
PRIMARY KEY (age, cid);
In this case, we move from one column in primary key (cid) to 2 columns in the primary key (age, and cid). Note the explicit check for non-NULL values in theWHERE` condition.
Schema I am using is as follows:
CREATE TABLE mytable(
id int,
name varchar,
PRIMARY KEY ((id),name)
) WITH CLUSTERING ORDER BY (name desc);
I wanted to delete records by following command :
DELETE FROM mytable WHERE name = 'Jhon';
But gived error
[Invalid query] message="Some partition key parts are missing: name"
As I looked for the reason, I came to know that only delete in not possible only with clustering columns.
Then I tried
DELETE FROM mytable WHERE id IN (SELECT id FROM mytable WHERE name='Jhon') AND name = 'Jhon';
But obviously it did not work.
I then tried with setting TTL to 0 for deleting row. But TTL can be set only for particular column, not the entire row.
What are feasible alternates to perform this operation?
In Cassandra, you need to design your data model to support your query. When you query your data, you always have to provide the partition key (otherwise the query would be inefficient).
The problem is that you want to query your data without a partition key. You would need to denormalize your data to support this kind or request. For example, you could add an additional table, such as:
CREATE TABLE id_by_name(
name varchar,
id int,
name varchar,
PRIMARY KEY (name, id)
) WITH CLUSTERING ORDER BY (id desc);
Then, you would be able to do your delete with a few queries:
SELECT ID from id_by_name WHERE name='John';
let's assume this returns 4.
DELETE FROM mytable WHERE id=4;
DELETE FROM id_by_name WHERE name='John' and id=4;
You could try to leverage materialized view (instead of maintaining yourself id_by_name) but materialized views are currently marked as unstable.
Now, there are still a few issues you need to address in your data model, in particular, how do you handle multiple user with the same name etc...
You cannot delete primary key if not complete. Primary key decisions are for sharding and load balancing. Cassandra can get complex if you are not used to thinking in columns.
I don't like the above answer, which though is good, complicates your solution. If you are thinking relational but getting lost in Cassandra I suggest using something that simplifies and maps your thinking to relational views.
I want to filter on a table that has a partition and a clustering key with another criteria on a regular column. I got the following warning.
InvalidQueryException: Cannot execute this query as it might involve
data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability,
use ALLOW FILTERING
I understand the problem if the partition and the clustering key are not used. In my case, is it a relevant error or can I ignore it?
Here is an example of the table and query.
CREATE TABLE mytable(
name text,
id uuid,
deleted boolean
PRIMARY KEY((name),id)
)
SELECT id FROM mytable WHERE name='myname' AND id='myid' AND deleted=false;
In Cassandra you can't filter data with non-primary key column unless you create index in it.
Cassandra 3.0 or up it is allowed to filter data with non primary key but in unpredictable performance
Cassandra 3.0 or up, If you provide all the primary key (as your given query) then you can use the query with ALLOW FILTERING, ignoring the warning
Otherwise filter from the client side or remove the field deleted and create another table :
Instead of updating the field to deleted true move your data to another table let's say mytable_deleted
CREATE TABLE mytable_deleted (
name text,
id uuid
PRIMARY KEY (name, id)
);
Now if you only have the non deleted data on mytable and deleted data on mytable_deleted table
or
Create index on it :
The column deleted is a low cardinality column. So remember
A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines are added to the cluster. You can avoid a performance hit when looking for a row in a large partition by narrowing the search.
Read More : When not to use an index