Primary key and indexing in cassandra

Primary key and indexing in cassandra - cassandra

new to cassandra, still learning.
create table url (
id_website int,
url varchar,
data varchar,
primary key(url, id_website)
);
Hi I have a table of url for a website.
I don't want all the url being on the same node, that's why primary key is url first, so it will be the partition key.
most of the time I'm going to retrieve the data for a specific url, eg : "url = ? and id_website = ?"
However what about the performance when I want to retrieve a part/all the urls of a website:
select * from url where id_website = 1 allow filtering limit XX;
I think this query is going to be dispatched on all the nodes, then, table scanning for id_website= 1 until limit is reach then merged and sent back to my client.
But is this scanning going to use an index and be effective or read the values of the column id_website one by one and compare (ineffective so) ? I did set id_website part of the primary key so I expect it to be indexed, but I really don't know.
Do we have some tools on cassandra like the EXPLAIN of mysql to check if a query is using index or not.
Thanks.
--
EDIT
Create a second table with id_website as partition key (and
write/delete in batch)
I don't want to use this solution because I may have one or two website which are really huge and have millions of urls (and millions of others website with little of urls).
If I have a partition key on id_website, and that this two or three website stay on the same node it may cause storage problem or the node handling these websites maybe too much solicited while the other get nothing. I want to spread the data over all the nodes. That's why I insisted to partition on the url.
You create a secondary index on id_website (which creates a table for
you)
What about this solution ? If I understand, each node would have a table indexing the rows it stores based on id_website (so not the rows of others nodes). So I can spread my urls across many nodes, I won't have one node handling a big indexing containing all the urls of a specific website.
Now when I use my query
select * from url where id_website = 1 allow filtering limit XX;
Each node receive the query, but they don't have to loop through the partition (url column) this time, they can directly lookup up in the index the urls belonging to id_website, and return the rows (or nothing). Right ?
The contra of this solution is everytime the request is done, it's going to hit each node, however, it should be fast thanks to the new index ?

You're on the right way. Using allow filtering you're just asking cassandra to scan all nodes: very ineffective. id_website is indexed within each partition but since you are not telling Cassandra where to go he must hit all partitions (all nodes) even those who doesn't contain information for the selected id_website -- once Cassandra hit a partition knows how to look for this information and does not need to scan the whole partition to get data back.
To solve this problem in Cassandra you have to denormalize and in this situation you can do it in two possible ways:
Create a second table with id_website as partition key (and write/delete in batch)
You create a secondary index on id_website (which creates a table for you)
**EDIT DUE TO QUESTION EDIT**
What you said is right: secondary indexes are handled as "local indexes" -- each node creates a local index table only for the data it owns. The following is a good reading about secondary indexes (that you already understood)
Once you created the index you have to remove ALLOW FILTERING from the query.
HTH,
Carlo

Related

How do I design a table in Cassandra for a TinyURL use case?

Recently I came across a well-known design problem.
'Tiny URL'
What I found was people vouching for NoSQL DBS such as DynamoDB or Cassandra. I've been reading about Cassandra for a couple of days, and I want to design my solution around this DB for this specific problem.
What would be the table definition? If I choose the following table definition:
Create table UrlMap(tiny_url text PRIMARY KEY, url text);
Wouldn't this result in a lot of partitions? since my partition key can take on around 68B values (using 6 char base64 strings)
Would that somehow affect the overall read/write performance? If so, what would be a better model to define the table.

Lot's of partitions is fine, think of it as using c* as a key value store.

The primary principle of data modelling in Cassandra is to design one table for each application query.
For a URL shortening service, the main application query is to retrieve the equivalent full URL for a given tiny URI. In pseudo-code, the query looks like:
GET long url FROM datastore WHERE uri = ?
Note that for the purpose of a service, we won't store the web domain name to make the app reusable for any domain. The filter (WHERE clause) is the URI so this is what you want as the partition key so we would design the table accordingly:
CREATE TABLE urls_by_uri (
uri text,
long_url text,
PRIMARY KEY(uri)
)
If we want to retrieve the URL for http://tinyu.rl/abc123, the CQL query is:
SELECT long_url FROM urls_by_uri WHERE uri = 'abc123'
As Phact and Andrew pointed, there is no need to worry about the number of partitions (records) you'll be storing in the table because you can store as many as 2^128 partitions in a Cassandra table which for practical purposes is limitless.
In Cassandra, each partition gets hashed into a token value using the Murmur3 hash algorithm (default partitioner). This implementation distributes each partition randomly across all nodes in the cluster. The same hash algorithm is used to determine which node "owns" the partition making retrieval (reads) very fast in Cassandra.
As long as you limit the SELECT queries to a single partition, retrieving the data is extremely fast. In fact, I work with hundreds of companies who have an SLA on reads of 95% between 6-9 milliseconds. This is achievable in Cassandra when you model your data correctly and size your cluster correctly. Cheers!

Regarding Cassandra's (sloppy, still confusing) documentation on keys, partitions

I have a high-write table I'm moving from Oracle to Cassandra. In Oracle the PK is a (int: clientId, id: UUID). There are about 10 billion rows. Right off the bat I run into this nonsensical warning:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html :
"If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index."
Not only does this seem to defeat efficient find by PK it fails to define what it means to "query between the fields" and what the difference is between a built-in index, a secondary-index, and the primary_key+clustering subphrases in a create table command. A junk description. This is 2019. Shouldn't this be fixed by now?
AFAIK it's misleading anyway:
CREATE TABLE dev.record (
clientid int,
id uuid,
version int,
payload text,
PRIMARY KEY (clientid, id, version)
) WITH CLUSTERING ORDER BY (id ASC, version DESC)
insert into record (id,version,clientid,payload) values
(d5ca94dd-1001-4c51-9854-554256a5b9f9,3,1001,'');
insert into record (id,version,clientid,payload) values
(d5ca94dd-1002-4c51-9854-554256a5b9e5,0,1002,'');
The token on clientid indeed shows they're in different partitions as expected.
Turning to the big point. If one was looking for a single row given the clientId, and UUID ---AND--- Cassandra allowed you to skip specifying the clientId so it wouldn't know which node(s) to search, then sure that find could be slow. But it doesn't:
select * from record where id=
d5ca94dd-1002-4c51-9854-554256a5b9e5;
InvalidRequest: ... despite the performance unpredictability,
use ALLOW FILTERING"
And ditto with other variations that exclude clientid. So shouldn't we conclude Cassandra handles high cardinality tables searches that return "very few results" just fine?

Anything that requires reading the entire context of the database wont work which is the case with scanning on id since any of your clientid partition key's may contain one. Walking through potentially thousands of sstables per host and walking through each partition of each of those to check will not work. If having hard time with data model and not totally getting difference between partition keys and clustering keys I would recommend you walk through some introduction classes (ie datastax academy), youtube videos or book etc before designing your schema. This is not a relational database and designing around your data instead of your queries will get you into trouble. When moving from oracle you should not just copy your tables over and move the data or it will not work as well.
The clustering key is the order in which the data for a partition is ordered on disk which is what it is referring to as "build-in index". Each sstable has an index component that contains the partition key locations for that sstable. This also includes an index of the clustering keys for each partition every 64kb (by default at least) that can be searched on. The clustering keys that exist between each of these indexed points are unknown so they all have to be checked. A long time ago there was a bloom filter of clustering keys kept as well but it was such a rare use case where it helped vs the overhead that it was removed in 2.0.
Secondary indexes are difficult to scale well which is where the warning comes from about cardinality, I would strongly recommend just denormalizing data and not using index in any form as using large scatter gather queries across a distributed system is going to have availability and performance issues. If you really need it check out http://www.doanduyhai.com/blog/?p=13191 to try to get the data right (not worth it in my opinion).

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);

IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost

Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

Is a read with one secondary index faster than a read with multiple in cassandra?

I have this structure that I want a user to see the other user's feeds.
One way of doing it is to fan out an action to all interested parties's feed.
That would result in a query like select from feeds where userid=
otherwise i could avoid writing so much data and since i am already doing a read I could do:
select from feeds where userid IN (list of friends).
is the second one slower? I don't have the application yet to test this with a lot of data/clustering. As the application is big writing code to test a single node is not worth it so I ask for your knowledge.

If your title is correct, and userid is a secondary index, then running a SELECT/WHERE/IN is not even possible. The WHERE/IN clause only works with primary key values. When you use it on a column with a secondary index, you will see something like this:
Bad Request: IN predicates on non-primary-key columns (columnName) is not yet supported
Also, the DataStax CQL3 documentation for SELECT has a section worth reading about using IN:
When not to use IN
The recommendations about when not to use an index apply to using IN
in the WHERE clause. Under most conditions, using IN in the WHERE
clause is not recommended. Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
As for your first query, it's hard to speculate about performance without knowing about the cardinality of userid in the feeds table. If userid is unique or has a very high number of possible values, then that query will not perform well. On the other hand, if each userid can have several "feeds," then it might do ok.
Remember, Cassandra data modeling is about building your data structures for the expected queries. Sometimes, if you have 3 different queries for the same data, the best plan may be to store that same, redundant data in 3 different tables. And that's ok to do.
I would tackle this problem by writing a table geared toward that specific query. Based on what you have mentioned, I would build it like this:
CREATE TABLE feedsByUserId
userid UUID,
feedid UUID,
action text,
PRIMARY KEY (userid, feedid));
With a composite primary key made up of userid as the partitioning key you will then be able to run your SELECT/WHERE/IN query mentioned above, and achieve the expected results. Of course, I am assuming that the addition of feedid will make the entire key unique. if that is not the case, then you may need to add an additional field to the PRIMARY KEY. My example is also assuming that userid and feedid are version-4 UUIDs. If that is not the case, adjust their types accordingly.

Cassandra or Hbase?

I have a requirement, where I want to store the following:
Mac Address // PKEY
TimeStamp // PKEY
LocationID
ownerName
Signal Strength
The insertion logic is as follows:
Store the above statistics for each active device (MacAddress) once every hour at each location (LocationID)
The entries are created at end of each hour, so the primary key will always be MAC+TimeStamp
There are no updates, only insertions
The queries which can be performed are as follows:
Give me all the entries for last 'N' hours Where MacAddress = "...."
Give me all the entries for last 'N' hours Where LocationID IN (locID1, locID2, ..);
Needless to say, there are billions of entries, and I want to use either HBASE or Cassandra. I've tried to explore, and it seems that Cassandra may not be correct choice.
The reasons for that is if I have the following in cassandra:
< < RowKey > MacAddress:TimeStamp > >
+ LocationID
+ OwnerName
+ Signal Strength
Both the queries will scan the whole database, right? Even if I add an index on LocationID, that is only going to help in the second query to some extent, because there is no index on timestamp (I believe that seaching on timestamp is not fast, as the MacAddress:TimeStamp composite Key would not allow us to search only on timestamp, and instead, a full scan would happen, is that correct?).
I'm stuck here big time, and any insights would really help, if we should opt HBase or Cassandra.

The right way to model this with Cassandra is to use a table partitioned by mac address, ordered by timestamp, and indexed on location id. See the Cassandra data model documentation, especially the section on clustering [predefined sorting]. None of your queries will require a full table scan.

You have to remember that NoSql instances like Cassandra allow horizontal scaling and make it a lot easier to shard the data. By developing a shard strategy (identifying shard key, etc) you could dramatically reduce the size of the data on a single instance and make queries (even when trying to query massive data sets) doable.

Either one would work for this query:
Give me all the entries for last 'N' hours Where MacAddress = "...."
In cassandra you would want to use an ordered partitioner so you can do easy scans. That way you would not have to scan the entire table. (I'm a little rusty on Cassandra).
In hbase it is always ordered by the rowkey so the scan becomes easy. You would just set a start and stop rowkey. Conceptually it would be:
scan.setStartRow(mac+":"+timestamp);
scan.setStopRow(mac+":"+endtimestamp);
And then it would only scan over the rows for the given mac address for the given time period--only a small subset of the data.
This query is much harder:
Give me all the entries for last 'N' hours Where LocationID IN
(locID1, locID2, ..);
Cassandra does have secondary indexes so it seems like it would be "easy" but I don't know how much data it would scan through. I haven't looked at Cassandra since it added secondary indexes.
In hbase you'd have to scan the entire table or create a second table. I would recommend creating a second table where the rowkey would be < location:timestamp > and you'd duplicate the data. Then you'd use that table to lookup the data by location using a scan and setting the start and end keys.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string