I have to create a table which stores a big amount of data (like 400 columns and 5.000.000 to 40.000.000 rows). There is a counter "counter" which counts from 1 upwards. Right now this is my primary key. The other variables are int, float, and varchar type and repeating.
I need to do this for a database-comparison, so I have to use Cassandra, even if there could be other databases, that can do better in this specific problem.
On this table I want to execute some range queries. The queries should be like:
SELECT counter, val1, val2, val3 FROM table WHERE counter > 1000 AND counter < 5000;
Also there will be other filter-parameters:
... AND val54 = 'OK';
I think this is a problem in Cassandra, because "counter" is the PK. I will try running the token() function, but I guess this will be slow.
Right now I am learning about the data modelling in Cassandra but I hope somebody with experience in Cassandra got some hints for me, like how to organize the table and make the queries possible and fast? Perhaps just some topics I should learn about or links that will help me.
Have a nice day,
Friedrich
This sounds like a bad use case for Cassandra.
First, range queries are discouraged in Cassandra. This is because the range can't be resolved with out visiting every node in the cluster.
Second, you can't mix a counter type column with other column types. For a given table it can either have (and only have) counter columns or it can have all non-counter columns.
As far as Cassandra data modeling goes, if you want to create a successful data model, create your partitions around the exact thing you're going to query.
Related
I read in Cassandra documentation that creating secondary index is less efficient as because in worst case it need to touch all nodes in order to find out the data of that non-key column.
But my doubt is even if we do not create secondary index, then also it will have to touch all nodes (in worst case) and find out where that particular row with this non-key column value resides.
Note: Yeah, I understand that it is possible that if the cardinality is high then the secondary index will contain(store) index for mostly all rows and in this way it is bad in terms of storage. But I want to know how not creating secondary index is efficient than creating secondary index?
Secondary indexes should be used only in specific cases, like, when you use them together with condition on partition key column, you have correct cardinality for data, etc.
For example, if we have following table:
create table test.test (
pk int,
c1 int,
val1 int,
val2 int,
primary key(pk, c1));
and you created a secondary index on the column val2, then following query will be very effective:
select * from test.test where pk = 123 and val2 = 10
because you restricted the execution of query only to the nodes that are replicas for pk with value 123.
But if you do
select * from test.test where val2 = 10
then Cassandra will need to go to the every node, and ask for data there - it will be much slower, and put a pressure to coordinating node.
Standard secondary indexes have other limitations, such as, search only for specific values, problems when column has very low or very high cardinality, etc. SASI indexes are better from design standpoint, although they are still experimental, and have problems with implementation.
You can find technical details about implementation of secondary indexes in the following blog post.
DataStax has other implementations in the commercial offering:
DSE Search that is based on the Apache Solr, so you get a lot of flexibility (full text search, range queries, etc.)
new implementation called SSTable Attached Indexes (SAI) - they are currently marked as beta, but they provide more flexibility than standard secondary indexes, with less overhead than DSE Search
I have a list of Strings "A", "B", "C".
I would like to know how can I check if all these Strings exist in a Cassandra column.
I have two approaches I have previously used for relational databases but I recently moved to Cassandra and I don't know how to achieve this.
The problem is I have about 100 string that I have to check and I don't want to send 100 requests to my database. It wouldn't be wise.
Interesting question... I don't know the schema you're using, but if your strings are in the only PK column (or in a composite PK where the other columns values are known at query time) then you could probably issue 100 queries without worries. The key cache will help not to hit disks, so your could get fast responses.
Instead, if you intend to use this for a column that is not part of any PK, you'll have hard time to figure this out unless you perform some kind of tricks, and this is all subject to some performance restrictions and/or increased code complexity anyway.
As an example, you could build a "frequency" table with the purpose described above, where you store how many times you "saw" each string "A", "B" etc..., and query this table when you need to retrieve the information:
SELECT frequencies FROM freq_table WHERE pk = IN ('A', 'B', 'C');
Then you still need to loop over the result set and check that each record is > 0. An alternative could be to issue a SELECT COUNT(*) before the real query, because you know in advance how many records you should get (eg 3 in my example), but having the correct number of retrieved records could be enough (eg one counter is zero).
Of course you'd need to maintain this table on every insert/update/delete of your main table, raising the complexity of the solution, and of course all the IN clause and COUNT related warning applies...
I would probably stick with 100 queries: with a well designed table they should not be a problem, unless you have an inadequate cluster for the problem size you're dealing with.
CQL gives you the possibility to use IN clause like:
SELECT first_name, last_name FROM emp WHERE empID IN (105, 107, 104);
More information here.
But this approach might not be the best since it can trigger select's across all nodes from the cluster.
So depends very much on how your data is structured.
From this perspective, it might be better to run 100 separate queries.
We are running Apache Cassandra 2.1.X and using Datastax driver. I've a use case where we need keep a count of various things. I came up with schema something like this:
create table count{
partitionKey bigInt,
type text,
uniqueId uuid,
primary_key(partitionKey, type, uniqueId)
So this is nothing but wide rows. My question is if I do something like
select count(uniqueId) from count where paritionKey=987 and type='someType' and this returns back with say 150k count.
Will it be a expensive operation for Cassandra? Is there a better way to compute count like this. I also want to know if anyone has solved something like this before?
I would prefer to stay away from keeping counter as it's not that accurate and keeping count at application level is anyways doomed to fail.
Also it will great to know how does Cassandra internally compute such data.
A big thanks to folks who help the community!
Even if you specify partition key cassandra still needs to read 150k cell to give you the count
If you haven't specify the partition key cassandra needs to scan all the node's all row to give you the count.
Best Approach is to use counter table.
CREATE TABLE id_count (
partitionkey bigint,
type text,
count counter,
PRIMARY KEY ((partitionkey, type))
);
Whenever a uniqueId insert increment the count here.
Imagine a table with thousands of columns, where most data in the row record is null. One of the columns is an ID, and this ID is known upfront.
select id,SomeRandomColumn
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
SomeRandomColumn is one of thousands, and in most cases the only column with data. SomeRandomColumn is NOT known upfront as the one that contains data.
Is there a CQL query that can do something like this.
select {Only Columns with data}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
I was thinking of putting in a "hint" column that points to the column with data, but that feels wrong unless there is a CQL query that looks something like this with one query;
select ColumnHint.{DataColumnName}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
In MongoDB I would just have a collection and the document I got back would have a "Type" attribute describing the data. So perhaps my real question is how do I replicate what I can do with MondoDB in Cassandra. My Cassandra journey so far is to create UDT's for each unique document, followed by altering the table to add this new UDT as a column. My starter table looks like this where ColumnDataName is the hint;
CREATE TABLE IF NOT EXISTS WideProductInstance (
Id uuid,
ColumnDataName text
PRIMARY KEY (Id)
);
Thanks
Is there a CQL query that can do something like this.
select {Only Columns with data}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
No, you cannot do that. And it's pretty easy to explain. To be able to know that a column contains data, Cassandra will need to read it. And if it has to read the data, since the effort is already spent on disk, it will just return this data to the client.
The only saving you'll get if Cassandra was capable of filtering out null column is on the network bandwidth ...
I was thinking of putting in a "hint" column that points to the column with data, but that feels wrong unless there is a CQL query that looks something like this with one query;
Your idea is like storing in another table a list of all column that actually contains real data and not null. It sounds like a JOIN which is bad and not supported. And if you need to read this reference table before reading the original table, you'll have to read at many places and it's going to be expensive
So perhaps my real question is how do I replicate what I can do with MondoDB in Cassandra.
Don't try to replicate the same feature from Mongo to Cassandra. The two database have fundamentally different architecture. What you have to do is to reason about your functional use-case. "How do I want to fetch my data from Cassandra ?" and from this point design a proper data model. Cassandra data model is designed by query.
The best advice for you is to watch some Cassandra Data Model videos (it's free) at http://academy.datastax.com
I have a Cassandra table that is created like:
CREATE TABLE table(
num int,
part_key int,
val1 int,
val2 float,
val3 text,
...,
PRIMARY KEY((part_key), num)
);
part_key is 1 for every record, because I want to execute range queries and only got one server (I know that's not a good use case). num is the record number from 1 to 1.000.000. I can already run queries like
SELECT num, val43 FROM table WHERE part_key=1 and num<5000;
Is it possible to do some more filtering in Cassandra, like:
... AND val45>463;
I think it's not possible like that, but can somebody explain why?
Right now I do this filtering in my code, but are there other possibilities?
I hope I did not miss a post that already explains this.
Thank you for your help!
Cassandra range queries are only possible on the last clustering column specified by the query. So, if your pk is (a,b,c,d), you can do
... where a=2, b=4, c>5
... where a=2, b>4
but not
... where a=2, c>5
This is because data is stored in partitions, index by partition key (the first key of the pk), and then sorted by each successive clustering key.
If you have exact values, you can add a secondary index to val 4 and then do
... and val4=34
but that's about it. And even then, you want to hit a partition before applying the index. Otherwise you'll get a cluster wide query that'll likely timeout.
The querying limitations are there due to the way cassandra stores data for fast insert and retrieval. All data in a partition is held together, so querying inside the partition client side is usually not a problem, unless you have very large wide rows (in which case, perhaps the schema should be reviewed).
Hope that helps.