is it possible to shard Vistess with the secondary sharding Key - vitess

We are using Vitess database to scale and achieve Horizontal Sharding in mysql. is it possible to do the secondary shard in Vitess.
For eg:
Table 1 - Agency
(
AgencyID INT,
CreatedOn DATETIME
)
Table 2 - PayrollDetails
(
AgencyID INT FOREIGN KEY TO Agency Table,
PayrollID INT,
PayrollCreatedOn DATETIME
)
Now We sharded both the tables with AgencyID as a Sharding Key. but PayrollDetails table is very huge and it has more then 100 million of records. So now we are planning to shard PayrollDetails table again with the PayrollCreatedOn field and Primary Shard for both the tables should be with the Agency Key but payrollDetails table should shard with the both AgencyID and PayrollCreatedOn.How can we achieve it in Vitess?

Conceptually, the sharding key (primary vindex) is used to decide which shard a row goes to. So, it's not possible to have two sharding keys because they would dictate conflicting locations for the row.
If I understand correctly, you want to query the table using PayrollCreatedOn in the where clause, you can create a secondary Vindex. This will create a lookup table that points at where the row lives, and Vitess can exploit that. There's an explanation for this here: https://vitess.io/docs/reference/vindexes/. There is a new command called CreateLookupVindex that is capable of backfilling this lookup table. It's yet to be documented, though.
Vitess also lets you "materialize" a table by using a different primary vindex. In that case, the second table will be a real-time copy of the first table, but sharded differently. You can see a demo for this on the vitess front page (scroll down to the video).

Related

Insert identical records into multiple tables with different primary keys

I have some billions records with 15 fields, which I want to insert them into Cassandra (with Java api). Since my queries search key can be one of the five different fields of record (i.e search query on fields 3 or 7 or 8 or 13 or 14), so I have created 5 identical tables with different primary keys in Cassandra (similar the note that is mentioned in enter link description here).
Now I read a record (or a batch of the records) and call "inserting into Cassandra" 5 times.
I want to know is there a mechanism in Cassandra that makes me to call "inserting into Cassandra" one times and storing the record(s) into 5 tables automatically?
For example the record(s) stores in MemTable at once (from my code by inserting at once) and the Cassandra core stores them in 5 tables in SSTable?
Since Cassandra 3.0 there is support for materialized views that could help you. But you need to design your source table carefully, as there is a number of limitations on how you can change structure of the materialized views comparing to source table - most notably:
* you can add to primary key at most one column that isn't in the primary key of source table;
* materialized view's primary key should contain all components of primary key of source table, but you can use different order of columns in primary key.
* all columns of materialized view's primary key should be non-null.
More details on these limitations you can find in this blog post.
You also need to be careful with changing partition key to not to get the big partitions (but you may have the same problem if you write data manually). Also, take into account that this adds more load to coordinator node that will need to distribute data to other servers if partition key is changed - when you write data "manually" then driver will send request directly to replica that holds that data.
Syntax for creation of materialized views is in the documentation - it quite similar to SQL's but not exactly (example from documentation):
CREATE TABLE cyclist_mv (cid UUID PRIMARY KEY,
name text, age int, birthday date, country text);
CREATE MATERIALIZED VIEW cyclist_by_age
AS SELECT age, birthday, name, country
FROM cyclist_mv
WHERE age IS NOT NULL AND cid IS NOT NULL
PRIMARY KEY (age, cid);
In this case, we move from one column in primary key (cid) to 2 columns in the primary key (age, and cid). Note the explicit check for non-NULL values in theWHERE` condition.

Creating secondary index on table in Cassandra

I have just started working on Cassandra.
I am bit confuse with the concept of secondary key.
From the definition I understood is indexing on the non key attribute of a table which is not sorted is secondary index.
So I have this table
CREATE TABLE IF NOT EXISTS userschema.user (id int,name text, address text, company text, PRIMARY KEY (id, name))
So If I create index like this
CREATE INDEX IF NOT EXISTS user_name_index ON userschema.user (name)
this should be secondary index.
But my requirement is to create index containing columns name , id , company.
How can I create a secondary index like this in Cassandra ?
I got this link which defines something of this short, but how come are these secondary indexes aren't they just table ?
These above user table is just the example not the actual one.
I am using Cassandra 3.0.9
id and name are already part of primary key.
So following queries will work
SELECT * FROM table WHERE id=1
SELECT * FROM table WHERE id=1 and name='some value'
SELECT * FROM table WHERE name='some value' ALLOW FILTERING (This is inefficeint)
You can create secondary index on company column
CREATE INDEX IF NOT EXISTS company_index ON userschema.user (company)
Now once secondary index is defined, it can be used in where clause along with primary key.
SELECT * FROM table WHERE id=1 and name='some value' and company='some value'
Though SELECT * FROM table WHERE company='some value' ALLOW FILTERING works it will be highly inefficient.
Before creating secondary index have look at When to use secondary index in cassandra
The link which you have referred mainly focuses on materialized views, in which we create virtual tables to execute the queries with non-primary keys.
Moreover, it seems you are creating secondary key on a Primary Key, which you have already defined in the creation of the table. Always remember that Secondary Index should be Non-Primary key.
To have a clear idea about the Secondary Indexes- Refer this https://docs.datastax.com/en/cql/3.3/cql/cql_using/useSecondaryIndex.html
Now, Pros and cons of the alternative methods for the secondary index
1.Materialized views:
It will create new virtual tables and you should run the queries in a virtual table using the old Primary keys in old and original tables and new virtual Primary keys in the new materialized table. Any changes in data modification in the original old table will be reflected at materialized table. If you drop the materialized table, but the data will be created as tombstones whose gcc_graceseconds is 864000(10 days) default. Dropping the materialized table will not have any effect on original table.
2.ALLOW FILTERING:
It is highly inefficient and is not at all advised to use allow filtering as the latencies will be high and performance will be degraded.
If you want much more information, refer this link too How do secondary indexes work in Cassandra?
Correct me if I am wrong

What are the pros and cons of grouping multiple table together in Cassandra?

The problem is Cassandra cannot handle a lot of tables per cluster (> 1000). I was looking for any means to reduce the number of tables, and one of them was grouping multiple tables that share the same structure to gether.
Let say if we have two table A and B
create table A (
key text,
value text,
primary key(key)
)
and
create table B (
key text,
value text,
primary key(key)
)
We can group them together by adding one more partition key
create table Shared (
original_table_name text, // either 'A' or 'B'
key text,
value text,
primary key(original_table_name, key)
)
My question is, is it a good pattern and what are the consequences of modelling data this way?
Please elaborate what you mean by alot of tables, because our production is running with 50+ tables, and I don't see any issue with it.
Anyways, if your application is using atlot of tables, then most probable cause of it it, normalized table. In cassandra you should always create denormalized tables, because of no join facility. Cassandra is built for very fast writes, so, you can count on it and not worry about that.
Now regarding the new design, I don't see any problem with that, only thing is your partition key should be combination of (table_name, key) and not just table_name so that it will be evenly distributed across nodes.
And ofcourse to query each time, you will have to specify table_name + key.

How to add the multiple column as a primary keys in cassandra?

I have an existing table with millions of records and initially we have two columns as partitioning key and clustering key and now I want add two more columns in a table as a partitioning key.
How?
If you make a change to the partition key you will need to create a new table and import the existing data. This is due to, in part, the fact that a partition key is not equal to a primary key in a relational database. The partition key is hashed by Cassandra and that hash is used to find partitions on disk. If you change the partition key you change the hash value and can no longer look up the partition!
CREATE TABLE KEYSPACE_NAME.AMAR_EXAMPLE (
COLUMN_1 TYPE,
COLUMN_2 TYPE,
COLUMN_3 TYPE,
...
COLUMN_N TYPE
// Here we declare the partition key columns and clustering columns
PRIMARY KEY ((COLUMN_1, COLUMN_2, COLUMN_3, COLUMN_4), CLUSTERING_COLUMN)
)
//If you need to change the default clustering order declare that here
WITH CLUSTERING ORDER BY (COLUMN_4 DESC);
You could export the data to CSV using COPY and then import the data to the new table via COPY or use the SSTABLELOADER. There is plenty of documentation and walkthroughs on how to use those tools. For example, this Datastax blog post talks about the changes made to the updated SSTABLELOADER. If you create a new table and import the existing data you will create new partitions and new hashes. Cassandra will not let you simply add additional columns to the partition key after the table has been created.
Understanding your data and the Cassandra data modeling techniques will help mitigate the amount of work you may find yourself doing changing partition keys. Check out the self-paced courses provided by Datastax. DS220: Data Modeling could really help.

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement:
I would like to update X rows that match some condition for a field that is not the primary key and is not unique.
For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.
How can i approach this by designing my column family differently ?
Thanks.
According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.
To update multiple users all at once, you have to know their IDs. Having a table defined as:
CREATE TABLE stackoverflow.users (
id timeuuid PRIMARY KEY,
dob timestamp,
status text
)
and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.
But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:
Create an index table like
CREATE TABLE stackoverflow.dob_index (
year int,
dob timestamp,
ids list<timeuuid>,
PRIMARY KEY (year, dob)
)
with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).
But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:
instead of updating users table directly, you can have a separate table user_status you insert new statuses:
CREATE TABLE user_statuses (
id timeuuid,
updated_at timestamp,
status text,
PRIMARY KEY (id, updated_at)
)
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Resources