Storing list in cassandra - cassandra

I want to save a friends list in Cassandra where a user may have few hundred of friends . Should i store the list of friends, which is an email id, as a list or set in Cassandra or should i create a separate table having the columns user_id and friends which will include all the user(millions of users) along with their friends .
If i create a separate table with user_id and friends column will there be degradation in performance while retrieving the entire friend list of the user/ one friend of the user as the table will contain many records/rows.

It is important to note that lists and sets in Cassandra are not iterable. This means when you query for them, you get back the whole list or the whole set. If the collection has a high cardinality then this could pose issues in querying such as read timeouts or even a heap OOM error.
Since it sounds like there is no cap on the amount of friends one can have, one option could be to have a separate table that is partitioned on user and clustered on friend.
CREATE TABLE user_friends (
owner_user_id int,
friend_user_id int,
PRIMARY KEY(owner_user_id, friend_user_id)
);
This will ensure that the friend_user_id is in order and will allow you to do client side paging if the number of friends is very large. It also allows for a quick way to check if a person is a friend of a user.

Related

Are client side joins permissable in Cassandra if client drills down on datapoint?

I have this structure with about 1000 data points in a list on the website:
Datapoint1:
Datapoint2:
...
Datapoint1000:
With each datapoint containing 6 fields of information.
Each datapoint can be opened to reveal an additional 2-3x of information in sublist.
Would making a new request upon the user clicking on one of my datapoints be considered bad practice in Cassandra? Should I just go ahead and get it all in one go?
Should I just go ahead and get it all in one go?
Definitely not.
Would making a new request upon the user clicking on one of my datapoints be considered bad practice in Cassandra?
That's absolutely the way you should do it. Cassandra is great at writing large amounts of data, but not so great a returning large amounts of data. More, small key-based queries are definitely the way to go.
It is possible to do the JOINs on the client side but as a general proposition, queries which require joins indicate that you possibly didn't design the data model correctly.
You need to model your data such that (a) each application query (b) maps to a single table. If you need to do a client-side JOIN then you need to query the database multiple times to get the data required by your app. It will work but it's not efficient so affects the performance of the app and the database.
To illustrate with an example, let's say you app needs to display a customer's list of orders. The table design would need to be partitioned by customer with (clustered) multiple rows of orders:
CREATE TABLE orders_by_customerid (
customerid text,
orderid text,
orderdate timestamp,
ordertotal decimal,
...
PRIMARY KEY (customerid, orderid)
)
You would retrieve the list of orders for a customer with:
SELECT ... FROM orders_by_customerid WHERE customerid = ?
By default, the driver or Stargate API your app is using would page the results so only the first 100 rows (for example) will be returned instead of retrieving thousands of rows in a single pass. Note that the page size is configurable. Cheers!

How Can I get Item by Attribute in DynamoDB in nodejs

I have 7 columns,
id, firstname, lastname, email, phone, createAt, updatedAt
I am trying to write an api in Nodejs to get items by phone.
id is the primary key.
I am trying to get the data by phone or email. I didn't created sortkey or GSI yet.
I ended up getting suggestions to use scan with filters in dynamodb and get all records .
Is there any other way to achieve it?
Your question already contains two good answers:
The slow way is to use a Scan with a FilterExpression to find the matching items. This will take the time (and also cost!) of reading the entire database on every query. It only makes sense if these queries are very infrequent.
If these queries by phone are not super-rare, it is better to be prepared in advance: add a GSI with the phone as its partition key, to allow looking up items by phone value using a Query with IndexName and KeyConditionExpression. These queries will be fast and cheap: you will only pay for items actually retrieved. The downside of this approach is the increased write costs: The cost
of every write doubles (DynamoDB writes to both the base table and
the index), and the cost of storage increases as well. But unless
your workload is write-mostly (items are very frequently updated and
very rarely read), option 2, using a GSI, is still better than
option 1 - a full-table Scan.
Finally another option you have is to reconsider your data model. For example, if you always look up items by phone and never by id, you can make phone the partition key of your data and id the sort key (to allow multiple items with the same phone). But I don't know if this is relevant for your use case. If you need to look up items sometimes by id and sometimes by phone, probably GSI is exactly what you need.

Cassandra changing Primary Key vs Firing multiple select queries

I have a table that stores list products that a user has. The table looks like this.
create table my_keyspace.userproducts{
userid,
username,
productid,
productname,
producttype,
Primary Key(userid)
}
All users belong to a group, there could be min 1 to max 100 users in a group
userid|groupid|groupname|
1 |g1 | grp1
2 |g2 | grp2
3 |g3 | grp3
We have new requirement to display all products for all users in a single group.
So do i change my userproducts so that my Partition Key is now groupid and make userid as my cluster key, so that i get all my results in one single query.
Or do I keep my table design as it is and fire multiple select queries by selecting all users in a group from second table and then fire one select query for each user, consolidate data in my code and then return it to the users
Thanks.
Even before getting to your question, your data modelling as you presented it has a problem: You say that you want to store "a list products that a user has". But this is not what the table you presented has - your table has a single product for each userid. The "userid" is the key of your table, and each entry in the table, i.e, each unique userid, has one combination of the other fields.
If you really want each user to have a list of products, you need the primary key to be (userid, productid). This means that each record is indexed by both a userid and a productid, or in other words - a userid has a list of records each with its own productid. Cassandra allows you to efficiently fetch all the productid records for a single userid because it implements the first part of the key as a "partition key" but the second part is a "clustering key".
Regarding your actual question, you indeed have two options: Either do multiple queries on your original tables, or do so-called denormalization, i.e., create a second table with exactly what you want searchable immediately. For the second option you can either do it manually (update both tables every time you have new data), or let Cassandra update the second table for you automatically, using a feature called Materialized Views.
Which of the two options - multiple queries or multiple updates - to use really depends on your workload. If it has many updates and rare queries, it is better to leave updates quick and make queries slower. If, on the other hand, it has few updates but many queries, it is better to make updates slower (when each update needs to update both tables) but make queries faster. Another important issue is how much query latency is important for you - the multiple queries option not only increases the load on the cluster (which you can solve by throwing more hardware at the problem) but also increases the latency - a problem which does not go away with more hardware and for some use cases may become a problem.
You can also achieve a similar goal in Cassandra by using the Secondary Index feature, which has its own performance characteristics (in some respects it is similar to the "multiple queries" solution).

Defining a partition key in Cassandra

I'm playing around with Cassandra for the first time and I feel like I understand the basics and limits. I'm working with the following model, as an example, for storing tweets collected by hashtag.
create table posts
(
id text,
status text,
service text,
hashtag text,
username text,
caption text,
image text,
link text,
repost boolean,
created timestamp,
primary key (hashtag, created)
);
This works very well for the type of query I need:
select * from posts where hashtag = 'demo' order by created desc;
However, if I understand things correctly, there is an upper limit to the number of posts I could store using the singular 'demo' partition key and more importantly, the entire set of posts matching the 'demo' partition key would have to be stored with each replica. I'd should probably use a more random or variable partition key (maybe the id of the post) if I understand correctly, but I don't know what to use that won't alter the requirements for the query.
If I use id as the partition key (e.g. PRIMARY KEY (id, created)) and add a secondary index on the hashtag column, I get the following error when I run my query:
ORDER BY with 2ndary indexes is not supported.
I get that to use ORDER BY, the partition key must be featured in the where clause, hence my original thought to use hashtag.
Am I overthinking things or is there a better candidate for the partition key?
The direction you go would depend on what volume of writes you expect and how big your cluster is.
If you have a small user community and a small cluster, then you might be overthinking things. A partition can theoretically hold up to 2 billion rows. That's a big number, and would anyone actually want to view more than a few thousand of the most recent tweets for a hashtag? So you'd probably have some kind of cleanup mechanism such as using TTL to delete tweets after some amount of time, which will free up space in the partition, keeping you well below the 2 billion row limit.
If you don't want to cleanup up old tweets, but want to preserve them for many years, then you might want to use a compound partition key like this:
primary key ((hashtag, year), created)
This would partition the tweets by the tag and the year, so you could store up to 2 billion tweets per tag per year.
The nice thing about partitioning by hashtag is that Cassandra can keep the tweets for a tag sorted by the creation timestamp, making it easy to retrieve the most recent ones with a single query as you've shown.
But if your user community is big, then the issue that is of a bigger concern is avoiding hot spots. If you use just hashtag and a time bin like year for a partition key, then all reads and writes will be to the small number of replicas for that hashtag. If a hashtag is very active on a given day, then you've got all your reads and writes going to just a node or two depending on what replication factor you are using.
If you want to spread out the read and write load, you need to increase the cardinality of a hashtag so that it will map to multiple nodes. Using id as the partition key would achieve this, but that would be going too far since then every tweet would be in a separate partition and you'd get no sorting or easy way to retrieve the most recent tweets for a hashtag.
So a better approach is to create separate bins or buckets, like this:
primary key ((hashtag, bin), created)
The number of bins you create depends on your write load. Let's say you decide that ten nodes can handle the write load for a hot hashtag, then bin would be a value from 0 to 9.
There are a number of ways to set the bin number. You could do a modulo of id by 10, or pick a random number between 0 and 9, or generate a hash value from some combination of fields and take modulo 10 of the results. Whatever method you choose, make sure the numbers from 0 to 9 are equally likely so that your data is spread equally across the bin partitions.
With multiple bins, it is not as easy to retrieve the x most recent tweets for a hashtag since you need to query all the bins and merge the results. You can asynchronously issue a query for each bin of a hashtag in parallel and then merge the results on the client side. Or you can do a single query using the IN clause like this:
select * from posts where hashtag = 'demo' and bin IN (0,1,2,3,4,5,6,7,8,9) AND created > ...
But Cassandra won't sort the results of the single query, so you'd have to do a sort on the client side, which is slower than doing a merge of separate ordered queries.
Now in many cases there will be hashtags that have very little volume, so you might not want to bother using ten bins for them unless they get hot. If so you can make it dynamic in your application, typically using just bin 0, but then increasing the number of bins when a tag is found to be popular. You could use a static column in bin 0 to keep track of the number of active bins for a hashtag.
You should avoid using secondary indexes. They are very inefficient in Cassandra.

Cassandra/Redis: Way to create feed without Cassandra 'IN' secondary index?

I'm having a bit of an issue with my application functionality integrating with Cassandra. I'm trying to create a content feed for my users. Users can create posts which, in turn, have the field user_id. I'm using Redis for the entire social graph and using Cassandra columns solely for objects. In Redis, user 1 has a set named user:1:followers with all of his/her follower ids. These follower ids correspond with the Cassandra ids in the users table and user_ids in the posts table.
My goal was originally to simply plug all of the user_ids from this Redis set into a query that would use FROM posts WHERE user_id IN (user_ids here) and grab all of the posts from the secondary index user_id. The issue is that Cassandra purposely does not support the IN operator in secondary indexes because that index would force Cassandra to search ALL of its nodes for that value. I'm left with only two options I can see: Either create a Redis list of user:1:follow_feed for the post IDs then search Cassandra's primary index for those posts in a single query, or keep it the way I have it now and run an individual query for every user_id in the user:1:follower set.
I'm really leaning against the first option because I already have tons and tons of graph data in Redis, and this option would add a new list for every user. The second way is far worse. I would put a massive read load on Cassandra and it would take a long time to run individual queries for a set of ids. I'm kind of stuck between a rock and a hard place, as far as I see it. Is there any way to query the secondary indexes with multiple values? If not, is there a more efficient way to load these content feeds (RAM and speed wise) compared to the options of more Redis lists or multiple Cassandra queries? Thanks in advance.
Without knowing the schema of the posts table (and preferably the others, as well), it's really hard to make any useful suggestions.
It's unclear to me why you need to have user_id be a secondary index, as opposed to your primary key.
In general it's quite useful to key content like posts off of the user that created it, since it allows you to do things like retrieve all posts (optionally over a given range, assuming they are chronologically sorted) very efficiently.
With Cassandra, if you find that a table can effectively answer some of the queries that you want to perform but not others, you are usually best of denormalizing that table and creating another table with a different structure in order to keep your queries to a single CQL partition and node.
CREATE TABLE posts (
user_id int,
post_id int,
post_text text,
PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC)
This table can answer queries such as:
select * from posts where user_id = 1234;
select * from posts where user_id = 1 and post_id = 53;
select * from posts where user_id = 1 and post_id > 5321 and post_id < 5400;
The reverse clustering on post_id is to make retrieving the most recent posts the most efficient by placing them at the beginning of the partition physically within the sstable.
In that example, user_id being a partition column, means "all cql rows with this user_id will be hashed to the same partition, and hence the same physical nodes, and eventually, the same sstables. That's why it's possible to
retrieve all posts with that user_id, as they are store contiguously
retrieve a slice of them by doing a ranged query on post_id
retrieve a single post by supplying both the partition column(user_id) and the clustering column (post_id)
In effect, this become a hashmap of a hashmap lookup. The one major caveat, though, is that when using partition and clustering columns, you always need to supply all columns from left to right in your query, without skipping any. So in this case, that means you can't retrieve an individual post without knowing the user_id that the post_id belongs to. That is addressable in user-code(by storing a reverse mapping and doing the lookup when necessary, or by encoding the user_id into the post_id that is passed around your application), but is definitely something to take into consideration.

Resources