Query I would like to fire
select * from t1 where c1 > 1000 and c2 > 1Million and c3 > 8Million
Data model of table t1
create table t1 {
c1 int,
c2 int,
c3 int,
c4 text
}
Which columns should I use as partition key and which as clustering key.
c1 , c2 , c3 can have value between 1 to 10 Million.
If I do primary key ((c1,c2,c3)) then the values will be spread across cluster. But as I fire > queries on c1,c2,c3 columns how does Cassandra know which nodes to contact or does it do a full shard scan?
It wont allow you to make that query without an ALLOW FILTERING which lets it read entire dataset because its throughout the cluster. It would read everything, throwing things away that don't match. Its highly recommended never to use ALLOW FILTERING outside dev/test unless really sure what your doing.
Partition keys can only be filtered with equalities, not inequalities such as the ones you have. Inequalities can only be used with clustering keys.
If your table does not have that many rows, you can use the bucket strategy. With it you create a auxiliary column to be the only partition key with a predefined value (such as 1).
create table t1 {
bucket int,
c1 int,
c2 int,
c3 int,
c4 text,
PRIMARY KEY (bucket, c1, c2, c3)
}
Because you have a single partition, it is not adequate for scaling tables with many rows.
If you do have many rows, which you need to partition, then you have to rethink your strategy, and think about:
Finding some kind of key (or keys) in the data that is able to partition the data and at the same time help filtering it when needed. Then you would use it as the partition key in the example above. Maybe denormalizing the data can help bring that key (Ex.: Creating a column called Status for Low/Medium/High numbers, which you could filter better later in the inequality filtering of the clustering keys).
Plan a table (or tables) to be queried by an analytics framework such as Spark. In analytics it's common the need to query by any column, with equalities or inequalities.
Related
I know that in a Cassandra table, inserts with the same partition key will overwrite the previous value. So, if we also insert 10 records with the same primary key it will do the same, meaning overwrite and store the 10th value only. Right?
So, I have the below table in my Cassandra database which has ~1 billion rows with ~4800 partition keys:
CREATE TABLE tb(
parkey varchar, //this is a UUID converted to String.
pk1 text,
pk2 float,
pk3 float,
pk4 float,
pk5 text,
pk6 text,
pk7 text,
PRIMARY KEY ((parkey),pk1, pk2, pk3, pk4, pk5, pk6, pk7));
This means I have ~1 billion primary keys!! I have such a big primary key because every record is unique only if it has all the values. However, I have a feeling this might not be the best table schema, as it also takes 5 minutes for spark to query all these data while it also hangs for another 10 minutes just before unpersisting a table from memory for which I do not know why!
Should I break down and denormalize the table somehow according to the queries being used? Will that improve the query times? My thought is, that even if I break down the table, I will still have ~1 billion primary keys for each denormalized table that will be created. Would that be efficient? Will it not take again 15 minutes to query the newly created tables?
Edit 1
I am always using 1 query that selects partition keys. Hence one table. Would this improve times?
CREATE TABLE tb(
parkey varchar, //this is a UUID converted to String.
pk1 varchar, //also a UUID but completely unique for every record
c1 text,
c2 float,
c3 float,
c4 float,
c5 text,
c6 text,
c7 text,
PRIMARY KEY ((parkey),pk1));
The quick answer is YES, you should denormalise the data and always start with the app queries. Those who come from a relational DB background tend to focus on how the data is stored (table schema) instead of listing all the app queries first.
By focusing on the app queries first THEN designing a table for each of the queries, the table is optimised for reads. If you try to adapt an app query to an existing table then the table will never be optimised and the queries will almost always be slow.
As a side note, the long answer is that 1B rows != 1B partitions in the schema you posted. The table definition does not have a 1:1 mapping between rows and partitions. Each of the partitions in your table can have ONE OR MORE rows. Cheers!
create table (id int primary key, c1 int, c2 int, ....c1000 int);
i know that RDBMS stores all columns of such one row continuously.if i want to query c400 of only one row, i can locate the row fast, but have to read out the whole file and then find the c400 value, thus when i want to query all the c400 values of the table, that means all the columns and all the rows will be read.
if i create the same table in cassandra and specify the column "id" as partition key.i know c1,c2,...c1000 in one partition will be stored like below :
{c1:1, c2:123, c3:45, ....., c1000: 10}
that's k-v store structure but is still columns continously stored.when i want to read column c400 of one partition, how cassandra fast read it without scaning the other columns?
It scans the columns, thats why it is recommende to avoid wide rows. Wider your rows are slower is your reads. Also wide rows put more pressure on heap.
i need to select 'N'th row from cassandra table based on the particular number i'm getting from my logic. i.e: if logic output is 23 means, i need to get 23rd row details. since there is no auto increment in cassandra,i cant able to go with ID key match. In SQL they getting it using OFFSET and LIMIT. i dont know how to achieve this feet in Cassandra.
Can we achieve this by using any UDF concept??? Someone reply me the solution.Thanks in advance.
Table Schema :
CREATE TABLE new_card (
customer_id bigint,
card_number text,
active tinyint,
auto_pay int,
available_credit_limit double,
average_card_spend_half_yearly double,
average_card_spend_monthly double,
average_card_spend_quarterly double,
average_card_spend_yearly double,
avg_half_yearly_spend_mcc double,
PRIMARY KEY (customer_id, card_number)
);
If you are using Java driver, refer Paging
Note, Cassandra does not support direct offsets, pages are read sequentially. If you have to use offsets to be used in your queries, you might want to revisit your data model. You could have created a composite partition key including the row number as an additional column on top of you existing partition key column.
You simply can't select N row from table, because Cassandra table is made from partitions, and you can order your rows within partition, but not the partitions. Going with paging will go throw all tables, but there's will be no chronological order of the rows selected using suck approach (disregarding the fact that the partitions can change while you doing your go-throw-pages stuff).
If you want to select row number N from Cassandra, you need to implement auto increment field on the application level and use it as key.
There's ways to do it with Cassandra, using lightweight transactions for example, but it have high cost from performance perceptive. See several solutions here:
How to create auto increment IDs in Cassandra
I have a Cassandra table that is created like:
CREATE TABLE table(
num int,
part_key int,
val1 int,
val2 float,
val3 text,
...,
PRIMARY KEY((part_key), num)
);
part_key is 1 for every record, because I want to execute range queries and only got one server (I know that's not a good use case). num is the record number from 1 to 1.000.000. I can already run queries like
SELECT num, val43 FROM table WHERE part_key=1 and num<5000;
Is it possible to do some more filtering in Cassandra, like:
... AND val45>463;
I think it's not possible like that, but can somebody explain why?
Right now I do this filtering in my code, but are there other possibilities?
I hope I did not miss a post that already explains this.
Thank you for your help!
Cassandra range queries are only possible on the last clustering column specified by the query. So, if your pk is (a,b,c,d), you can do
... where a=2, b=4, c>5
... where a=2, b>4
but not
... where a=2, c>5
This is because data is stored in partitions, index by partition key (the first key of the pk), and then sorted by each successive clustering key.
If you have exact values, you can add a secondary index to val 4 and then do
... and val4=34
but that's about it. And even then, you want to hit a partition before applying the index. Otherwise you'll get a cluster wide query that'll likely timeout.
The querying limitations are there due to the way cassandra stores data for fast insert and retrieval. All data in a partition is held together, so querying inside the partition client side is usually not a problem, unless you have very large wide rows (in which case, perhaps the schema should be reviewed).
Hope that helps.
I'm undecided whether it's better, performance-wise, to use a very commonly shared column value (like Country) as partition key for a compound primary key or a rather unique column value (like Last_Name).
Looking at Cassandra 1.2's documentation about indexes I get this:
"When to use an index:
Cassandra's built-in indexes are best on a table
having many rows that contain the indexed value. The more unique
values that exist in a particular column, the more overhead you will
have, on average, to query and maintain the index. For example,
suppose you had a user table with a billion users and wanted to look
up users by the state they lived in. Many users will share the same
column value for state (such as CA, NY, TX, etc.). This would be a
good candidate for an index."
"When not to use an index:
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a column
that has many distinct values, a query between the fields will incur
many seeks for very few results. In the table with a billion users,
looking up users by their email address (a value that is typically
unique for each user) instead of by their state, is likely to be very
inefficient. It would probably be more efficient to manually maintain
the table as a form of an index instead of using the Cassandra
built-in index. For columns containing unique data, it is sometimes
fine performance-wise to use an index for convenience, as long as the
query volume to the table having an indexed column is moderate and not
under constant load."
Looking at the examples from CQL's SELECT for
"Querying compound primary keys and sorting results", I see something like a UUID being used as partition key... which would indicate that it's preferable to use something rather unique?
Indexing in the documentation you wrote up refers to secondary indexes. In cassandra there is a difference between the primary and secondary indexes. For a secondary index it would indeed be bad to have very unique values, however for the components in a primary key this depends on what component we are focusing on. In the primary key we have these components:
PRIMARY KEY(partitioning key, clustering key_1 ... clustering key_n)
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible. That is why the example you have uses UUIDs.
The clustering key is used for ordering so that querying columns with a particular clustering key can be more efficient. That is where you want your values to not be unique and where there would be a performance hit if unique rows were frequent.
The cql docs have a good explanation of what is going on.
if you use cql3, given a column family:
CREATE TABLE table1 (
a1 text,
a2 text,
b1 text,
b2 text,
c1 text,
c2 text,
PRIMARY KEY ( (a1, a2), b1, b2) )
);
by defining a
primary key ( (a1, a2, ...), b1, b2, ... )
This implies that:
a1, a2, ... are fields used to craft a row key in order to:
determine how the data is partitioned
determine what is phisically stored in a single row
referred as row key or partition key
b1, b2, ... are column family fields used to cluster a row key in order to:
create logical sets inside a single row
allow more flexible search schemes such as range range
referred as column key or cluster key
All the remaining fields are effectively multiplexed / duplicated for every possible combination of column keys. Here below an example about composite keys with partition keys and clustering keys work.
If you want to use range queries, you can use secondary indexes or (starting from cql3) you can declare those fields as clustering keys. In terms of speed having them as clustering key will create a single wide row. This has impact on speed since you will fetch multiple clustering key values such as:
select * from accounts where Country>'Italy' and Country<'Spain'
I am sure you would have got the answer but still this can help you for better understanding.
CREATE TABLE table1 (
a1 text,
a2 text,
b1 text,
b2 text,
c1 text,
c2 text,
PRIMARY KEY ( (a1, a2), b1, b2) )
);
here the partition keys are (a1, a2) and row keys are b1,b2.
combination of both partition keys and row keys must be unique for each new record entry.
the above primary key can be define like this.
Node< key, value>
Node<(a1a2), Map< b1b2, otherColumnValues>>
as we know Partition Key is responsible for data distribution accross your nodes.
So if you are inserting 100 records in table1 with same partition keys and different row keys. it will store data in same node but in different columns.
logically we can represent like this.
Node<(a1a2), Map< string1, otherColumnValues>, Map< string2, otherColumnValues> .... Map< string100, otherColumnValues>>
So the record will store sequentially in memory.