DynamoDB sorting through data - node.js

Everywhere I look, the web is telling me to never use scan() in dynamoDB.
It uses all your capacity units, 1mb response size, etc.
I’ve looked at querying, but that doesn’t achieve what I want either.
How am I supposed to parse through my table?
Here is my setup-
I have a table “people” with rows of people.
I have attributes “email” (partition key), “fName”, “lName”, “displayName”, “passwordHash”, and “subscribed”.
subscribed is either true or false, and I need to sort through every person who is subscribed.
I can’t use a sort key because all emails are unique…
It is my understanding that DynamoDB data is sorted like follows:
primary key-
—sort key 1
——— Item 1
—sort key 2
——- Item 2
primary key 2
—Sort ket 1
..etc..
So setting subscribed as a sort ket would not work… I would still need to loop through every primary key.
Right now I am just getting every item with a filterExpression to check if someone is subscribed.
If they are, they pass. But what happens when I have hundreds of users, whose data eclipses 1mb?
I wouldn’t get every user that is subscribed in this case, and sending repeating requests with the start key to get every Mb of data is too tedious for the processor, and would slow the server down significantly
Are there any recommendations for how I should go about getting every subscribed user?
Note: Subscribed can not be a primary key and the email a sort key, because I have instances where I need just the user, which is easy to access if the email is the primary key.

Right now I am just getting every item with a filterExpression to check if someone is subscribed. If they are, they pass. But what happens when I have hundreds of users, whose data eclipses 1mb?
GetItem for single person lookups
You should ideally be using a GetItem here by providing the users email as a search parameter, and then checking if they are subscribed or not. Scanning to see if an individual is subscribed is not scalable in any way.
Pagination
When data exceeds 1MB you simply paginate:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html
Are there any recommendations for how I should go about getting every subscribed user?
Sparse Indexes
For this use case it's best to use a sparse index, in which you set subscribed="true" only if it's true, if it's false don't set it (you must use a string also, as boolean can't be used as a key).
Once you do so, you can create a GSI on the attribute subscribed, now only the items which are true are contained in your GSI making it sparse. So a Scan on that value now makes it as efficient as possible, albeit it will limit throughout capacity to 1000 WCU.
Making things scalable
An even better way to do so is to create an attribute called GSI_PK and assign it a random number. Then use subscribed as a sort key, again using a string and only when true. This will mean that your index will not become a bottleneck and limit your throughput to 1000 WCU due to a single value being Partition key.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-general-sparse-indexes.html

Related

Tally unread (chat) messages in database

My goal is to create daily reports for users about chat messages they've missed/not read yet. Right now, all data is getting stored in a ScyllaDB, and that is working out well for the most part. But when it comes to these reports I've no idea whether there a good way to achieve that without changing the database system.
Thing is, I don't want to query query for each user the unread messages. (I could do that because messages have a timeuuid I can compare with a last_read timestamp, but it's slow because it meant multiple queries for every single user there is.) Therefore, I tried to create a dedicated table for the reporting:
CREATE TABLE
user uuid,
channel uuid,
count_start_time timestamp,
missed_count int,
PRIMARY KEY (channel, user)
)
Once a new message in the channel arrives, I can retrieve all users in that channel (from another table). My idea was to increment missed_count, or decrement it in case a message was deleted (and it's creation timestamp is > count_start_time, I figure I could achieve that with an IF condition to the update). Once a user reads his messages, I reset the count_start_time to current date and missed_count to 0.
But several issues arise here:
Since I can't use a Counter my updates aren't atomic. But I think I could live with that.
For the reasons below it would be ideal if I could just delete a row once messages get read instead of reseting timestamp and counter. But I've read that many deletions might cause performance issues (and I'm also not sure what would happen if the entry gets recreated after a short period b/c new messages arrive in the channel again)
The real bummer: since I did not want to iterate over all users on the system in the first place, I don't want to iterate over all entries here either. The naive idea would be to query with WHERE missed_count > 0. But missed_count isn't part of the cluster key so for my understanding that's not feasible.
Since I have to paginate, it could happen that I get the missed messages for a single user in different hunks. I mean, it could happen that I report to user1 that he has unread messaged from channel1 first, and later that he has unread messages from channel2, That means additional overhead in case I want to avoid multiple reports for the same user.
Is there a way I could structure my table to solve that problem, especially how to query only entries with missed_count > 0 or to utilize row deletion? Or is my goal beyond the design of Cassandra/ScyllaDB?
Thanks in advance!

How Can I get Item by Attribute in DynamoDB in nodejs

I have 7 columns,
id, firstname, lastname, email, phone, createAt, updatedAt
I am trying to write an api in Nodejs to get items by phone.
id is the primary key.
I am trying to get the data by phone or email. I didn't created sortkey or GSI yet.
I ended up getting suggestions to use scan with filters in dynamodb and get all records .
Is there any other way to achieve it?
Your question already contains two good answers:
The slow way is to use a Scan with a FilterExpression to find the matching items. This will take the time (and also cost!) of reading the entire database on every query. It only makes sense if these queries are very infrequent.
If these queries by phone are not super-rare, it is better to be prepared in advance: add a GSI with the phone as its partition key, to allow looking up items by phone value using a Query with IndexName and KeyConditionExpression. These queries will be fast and cheap: you will only pay for items actually retrieved. The downside of this approach is the increased write costs: The cost
of every write doubles (DynamoDB writes to both the base table and
the index), and the cost of storage increases as well. But unless
your workload is write-mostly (items are very frequently updated and
very rarely read), option 2, using a GSI, is still better than
option 1 - a full-table Scan.
Finally another option you have is to reconsider your data model. For example, if you always look up items by phone and never by id, you can make phone the partition key of your data and id the sort key (to allow multiple items with the same phone). But I don't know if this is relevant for your use case. If you need to look up items sometimes by id and sometimes by phone, probably GSI is exactly what you need.

What is the disadvantage to unique partition keys?

My data set will only ever be directly queried (meaning I am looking up a specific item by some identifier) or will be queried in full (meaning return every item in the table). Given that, is there any reason to not use a unique partition key?
From what I have read (e.g.: https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#choosing-an-appropriate-partitionkey) the advantage of a non-unique partition key is being able to do transactional updates. I don't need transactional updates in this data set so is there any reason to partition by anything other than some unique thing (e.g., GUID)?
Assuming I go with a unique partition key per item, this means that each partition will have one row in it. Should I repeat the partition key in the row key or should I just have an empty string for a row key? Is a null row key allowed?
Zhaoxing's answer is essentially correct but I want to expand on it so you can understand a bit more why.
A table partition is defined as the table name plus the partition key. A single server can have many partitions, but a partition can only ever be on one server.
This fundamental design means that access to entities stored in a single partition cannot be load-balanced because partitions support atomic batch transactions. For this reason, the scalability target for an individual table partition is lower than for the table service as a whole. Spreading entities across many partitions allows Azure storage to scale your load much better.
Point queries are optimal which is great because it sounds like that's what you will be doing a lot of. If partition key has no logical meaning (ie, you won't want all the entities in a particular partition) you're best splitting out to many partition keys. Listing all entities in a table will always be slower because it's a scan. Azure storage will return continuation tokens if we hit timeout, 1000 entities, or a server boundary (as discussed above). Many of the storage client libraries have convenience methods which should help you handle this by automatically following these tokens as you iterate through the list.
TL;DR: With the information you've given I'd recommend a unique partition key per item. Null row keys are not allowed, but however else you'd like to construct the row key is fine.
Reading:
Azure Storage Table Design Guide
Azure Storage Performance Check List
If you don't need EntityGroupTransaction to update entities in batch, unique partition keys are good option to you.
Table service auto-scale feature may not work perfectly I think. When some of data in a partition are 'hot', table service will move them to another cluster to enhance performance. But since you have unique partition key, probably non of your entity will be determined as 'hot', while if you grouped them in partitions some partition will be 'hot' and moved. This problem below may also be there if you are using static partition key.
Besides, table service may returns partial entities of your query when
More than 1000 entities in result.
Partition boundary is crossed.
From your request you also need full query (return all entities). If your are using unique partition key this mean each entity is a unique partition, so your query will only return 1 entity with a continue token. And you need to fire another query with this continue token to retrieve the next entity. I don't think this is what you want.
So my suggestion is, select a reasonable partition key in any cases, even though it looks useless in your business, because it helps table service to optimize your data.

Cassandra schema advice needed

I'm designing a Cassandra schema for a browser event collection system, and I was hoping to sanity check my approach. The system collects user events in the browser, like mouse movements, clicks, etc. The events are stored and processed to create heat maps of user activity on a web page. I've chosen Cassandra for persistence, since my use case is more write heavy than ready heavy: every 50 milliseconds, an ajax call dumps the aggregated events to my server, and into the database. I'm using node.js for the server, and the JSON events look something like this on the server:
{ uuid: dsf86ag487hadf97hadf97, type: 'MOVE', time: 12335234345, pageX: 334, pageY:566, .... }
As you can see each user has a unique uuid, associated with each of their events, generated on the browser, stored in a cookie. My read case will be some map-reduce job. Each top-level domain will be a keyspace, and I was planning using the uuid as my partition key. The main table will be the events table, where each row will be one event, using a composite primary key, consisting of the browser-generated uuid and a cassandra-generated timeuuid. The primary key must have a timeuuid component, since two events may have the same timestamp on certain browsers. The data types for event will be strings, ints, timestamps. The total data for a partition should not exceed a few hundred megabytes. So...Is this sane? What questions should I be asking myself? I recognize that this use case has many analogs in sensor data collection, etc, so please point me to existing examples. Thanks in advance.
Choosing a partition key
While recording the user ID may be important in some cases for distinguishing events from different users that may occur at the same time, the user ID is probably not the best choice for the partition key. That is, unless you are planning to analyze the behavior of specific users.
You are probably more concerned with how the heatmap changes over time and specifically which areas of the page were involved. These are probably better considerations for your partition key, though perhaps not stored as a timestamp nor as X/Y coordinates, which I'll get into later.
You will generally want to choose a partition key that has (1) a large distribution of values, to create even load across your cluster, and (2) is made up of values that are relatively "well known". By "well known", I mean something you either know in advance or something that can be computed easily and deterministically. For instance, you will have many users and will gather statistics over many days. While the the specific of days (encoded as, say, YYYY-MM-DD strings) can be easily determined based on a known start/end date range or query input, the set of all valid user IDs (assuming UUIDs or other non-incremental value, or hash) is much harder to determine without doing a scan of the entire cluster. Avoid doing partition key scans; aim for "exact" random access to your partitions.
Format of the partition key
The partition key is traditionally shown as a single column in many examples, but you can have a multi-column partition key. This can be useful when using date/time information as all or part of the key. You would aim to have as few unique values per column as possible, so that the set of values you need to enumerate is as small as possible, but as many values (or additional columns) as necessary to balance the I/O load and data distribution across the cluster.
For example, if you were to use a timestamp as your partition key, in 64-bit Java timestamp format, there are 1,000 possible partitions per second. Even though you can technically iterate over them, that may be more granular than you need or want. On the other side, if your partition key were simply the 4-digit year, then all of that year's events would go to the same partition (making it very large) and to the same set of replica nodes (hotspots, inefficient cluster use). By choosing a key that balances between these extremes, you can control the size of your partitions and also the number of partitions you must access in order to satisfy a query.
Also consider what you'll do when you ever want to delete old data. The easiest means (within a single column family/table) is to delete an entire partition as this helps avoid accumulating individual column tombstones. If you ever want to run an operation like "delete all data older than 2013" then you definitely don't want to bury the date deep down in the data and would rather have it as part of your partition key.
Choosing a row (clustering) key
Any additional columns in the primary key that are not part of the partition key become the row key within the partition, and the rows are clustered (ordered) by the sort order of the first of these columns.
That clustering/sorting is important, because it's generally the only native sorting you're going to get with Cassandra. Even if the partition key is down to the level of a specific hour or minute of a specific day, you might choose to cluster the rows by your millisecond timestamp or time UUID, to keep everything within that partition in chronological order.
You can still have additional columns, like your X/Y coordinates or user IDs, in your row keys -- in case it sounded like I was recommending that you put time (only) in both the partition and clustering keys.
Using X/Y coordinates
This part has nothing to do with Cassandra, but if you're heat-mapping the page, do be aware that people use different screens and devices at different resolutions. Unless you're doing pixel-perfect layout on your site (and hopefully you're using a fluid, responsive layout instead) then the X/Y coordinate of one user isn't going to match the X/Y coordinates from another user. They might not even match for the same user, if that user switches devices.
Consider mapping not by X/Y coordinate of the mouse, but perhaps the IDs of elements in the DOM. Have an ID for your "sidebar", "main menu", "main body div" and any specific elements you want to map. These would be string keys, not coordinate pairs, and while they'd still be triggered on mouse enter/leave/click the logged information doesn't depend or assume any particular screen geometry.
Perhaps you decide to include the element ID as part of the row or partition key, too.

Rankings in Azure Table

I am just stuck in a design problem. I want to assign ranks to user records in a table. They do some action on the site and given a rank on basis of leader board. And the select I want on them could be on Top 10, User's position, Top 10 logged in today etc.
I just can not find a way to store it in Azure table. Than I thought about storing custom collection object (a sorted list) in blob.
Any suggestions?
Table entities are sorted by PartitionKey, RowKey. While you could continually delete and recreate users (thus allowing you to change the PK, RK) to give the correct order, it seems like a bad idea or at least overkill. Instead, I would probably store the data that you use the compute the rankings and periodically compute and store the rankings (as you say). We do this a lot in our work - pre-compute what the data should look like in JSON view, store it in a blob, and let the UI query it directly. The trick is to decide when to re-compute the view. After a user does an item that would cause the rankings to be re-computed, I would probably queue a message and let a worker process go and re-compute the view. This prevents too many workers from trying to update the data at once.

Resources