How Can I get Item by Attribute in DynamoDB in nodejs - node.js

I have 7 columns,
id, firstname, lastname, email, phone, createAt, updatedAt
I am trying to write an api in Nodejs to get items by phone.
id is the primary key.
I am trying to get the data by phone or email. I didn't created sortkey or GSI yet.
I ended up getting suggestions to use scan with filters in dynamodb and get all records .
Is there any other way to achieve it?

Your question already contains two good answers:
The slow way is to use a Scan with a FilterExpression to find the matching items. This will take the time (and also cost!) of reading the entire database on every query. It only makes sense if these queries are very infrequent.
If these queries by phone are not super-rare, it is better to be prepared in advance: add a GSI with the phone as its partition key, to allow looking up items by phone value using a Query with IndexName and KeyConditionExpression. These queries will be fast and cheap: you will only pay for items actually retrieved. The downside of this approach is the increased write costs: The cost
of every write doubles (DynamoDB writes to both the base table and
the index), and the cost of storage increases as well. But unless
your workload is write-mostly (items are very frequently updated and
very rarely read), option 2, using a GSI, is still better than
option 1 - a full-table Scan.
Finally another option you have is to reconsider your data model. For example, if you always look up items by phone and never by id, you can make phone the partition key of your data and id the sort key (to allow multiple items with the same phone). But I don't know if this is relevant for your use case. If you need to look up items sometimes by id and sometimes by phone, probably GSI is exactly what you need.

Related

DynamoDB sorting through data

Everywhere I look, the web is telling me to never use scan() in dynamoDB.
It uses all your capacity units, 1mb response size, etc.
I’ve looked at querying, but that doesn’t achieve what I want either.
How am I supposed to parse through my table?
Here is my setup-
I have a table “people” with rows of people.
I have attributes “email” (partition key), “fName”, “lName”, “displayName”, “passwordHash”, and “subscribed”.
subscribed is either true or false, and I need to sort through every person who is subscribed.
I can’t use a sort key because all emails are unique…
It is my understanding that DynamoDB data is sorted like follows:
primary key-
—sort key 1
——— Item 1
—sort key 2
——- Item 2
primary key 2
—Sort ket 1
..etc..
So setting subscribed as a sort ket would not work… I would still need to loop through every primary key.
Right now I am just getting every item with a filterExpression to check if someone is subscribed.
If they are, they pass. But what happens when I have hundreds of users, whose data eclipses 1mb?
I wouldn’t get every user that is subscribed in this case, and sending repeating requests with the start key to get every Mb of data is too tedious for the processor, and would slow the server down significantly
Are there any recommendations for how I should go about getting every subscribed user?
Note: Subscribed can not be a primary key and the email a sort key, because I have instances where I need just the user, which is easy to access if the email is the primary key.
Right now I am just getting every item with a filterExpression to check if someone is subscribed. If they are, they pass. But what happens when I have hundreds of users, whose data eclipses 1mb?
GetItem for single person lookups
You should ideally be using a GetItem here by providing the users email as a search parameter, and then checking if they are subscribed or not. Scanning to see if an individual is subscribed is not scalable in any way.
Pagination
When data exceeds 1MB you simply paginate:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html
Are there any recommendations for how I should go about getting every subscribed user?
Sparse Indexes
For this use case it's best to use a sparse index, in which you set subscribed="true" only if it's true, if it's false don't set it (you must use a string also, as boolean can't be used as a key).
Once you do so, you can create a GSI on the attribute subscribed, now only the items which are true are contained in your GSI making it sparse. So a Scan on that value now makes it as efficient as possible, albeit it will limit throughout capacity to 1000 WCU.
Making things scalable
An even better way to do so is to create an attribute called GSI_PK and assign it a random number. Then use subscribed as a sort key, again using a string and only when true. This will mean that your index will not become a bottleneck and limit your throughput to 1000 WCU due to a single value being Partition key.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-general-sparse-indexes.html

Are client side joins permissable in Cassandra if client drills down on datapoint?

I have this structure with about 1000 data points in a list on the website:
Datapoint1:
Datapoint2:
...
Datapoint1000:
With each datapoint containing 6 fields of information.
Each datapoint can be opened to reveal an additional 2-3x of information in sublist.
Would making a new request upon the user clicking on one of my datapoints be considered bad practice in Cassandra? Should I just go ahead and get it all in one go?
Should I just go ahead and get it all in one go?
Definitely not.
Would making a new request upon the user clicking on one of my datapoints be considered bad practice in Cassandra?
That's absolutely the way you should do it. Cassandra is great at writing large amounts of data, but not so great a returning large amounts of data. More, small key-based queries are definitely the way to go.
It is possible to do the JOINs on the client side but as a general proposition, queries which require joins indicate that you possibly didn't design the data model correctly.
You need to model your data such that (a) each application query (b) maps to a single table. If you need to do a client-side JOIN then you need to query the database multiple times to get the data required by your app. It will work but it's not efficient so affects the performance of the app and the database.
To illustrate with an example, let's say you app needs to display a customer's list of orders. The table design would need to be partitioned by customer with (clustered) multiple rows of orders:
CREATE TABLE orders_by_customerid (
customerid text,
orderid text,
orderdate timestamp,
ordertotal decimal,
...
PRIMARY KEY (customerid, orderid)
)
You would retrieve the list of orders for a customer with:
SELECT ... FROM orders_by_customerid WHERE customerid = ?
By default, the driver or Stargate API your app is using would page the results so only the first 100 rows (for example) will be returned instead of retrieving thousands of rows in a single pass. Note that the page size is configurable. Cheers!

Cassandra changing Primary Key vs Firing multiple select queries

I have a table that stores list products that a user has. The table looks like this.
create table my_keyspace.userproducts{
userid,
username,
productid,
productname,
producttype,
Primary Key(userid)
}
All users belong to a group, there could be min 1 to max 100 users in a group
userid|groupid|groupname|
1 |g1 | grp1
2 |g2 | grp2
3 |g3 | grp3
We have new requirement to display all products for all users in a single group.
So do i change my userproducts so that my Partition Key is now groupid and make userid as my cluster key, so that i get all my results in one single query.
Or do I keep my table design as it is and fire multiple select queries by selecting all users in a group from second table and then fire one select query for each user, consolidate data in my code and then return it to the users
Thanks.
Even before getting to your question, your data modelling as you presented it has a problem: You say that you want to store "a list products that a user has". But this is not what the table you presented has - your table has a single product for each userid. The "userid" is the key of your table, and each entry in the table, i.e, each unique userid, has one combination of the other fields.
If you really want each user to have a list of products, you need the primary key to be (userid, productid). This means that each record is indexed by both a userid and a productid, or in other words - a userid has a list of records each with its own productid. Cassandra allows you to efficiently fetch all the productid records for a single userid because it implements the first part of the key as a "partition key" but the second part is a "clustering key".
Regarding your actual question, you indeed have two options: Either do multiple queries on your original tables, or do so-called denormalization, i.e., create a second table with exactly what you want searchable immediately. For the second option you can either do it manually (update both tables every time you have new data), or let Cassandra update the second table for you automatically, using a feature called Materialized Views.
Which of the two options - multiple queries or multiple updates - to use really depends on your workload. If it has many updates and rare queries, it is better to leave updates quick and make queries slower. If, on the other hand, it has few updates but many queries, it is better to make updates slower (when each update needs to update both tables) but make queries faster. Another important issue is how much query latency is important for you - the multiple queries option not only increases the load on the cluster (which you can solve by throwing more hardware at the problem) but also increases the latency - a problem which does not go away with more hardware and for some use cases may become a problem.
You can also achieve a similar goal in Cassandra by using the Secondary Index feature, which has its own performance characteristics (in some respects it is similar to the "multiple queries" solution).

Regarding Azure table design

I am working as freelancer and right now working on one of my game and trying to use Azure table service to log my user moves in Azure tables.
The game is based on Cards.
The flow is like this:
Many users(UserId) will be playing on a table(TableId). Each game on the table will have a unique GameId. In each game there could be multiple deals with Unique DealId.
There can be multiple deals on the same table with same gameId. Also each user will have same DealId in a single game.
Winner is decided after multiple chances of a player.
Problem:
I can make TableId as PartitionKey and but I am not sure what to chose for RowKey because combination of TableId and RowKey (GameId/UserId/DealId) should be unique in the table.
I can have entries like:
TableId GameId DealId UserId timestamp
1 201 300 12345
1 201 300 12567
May be what I can do is to create 4 Azure tables like below but I am doing a lot of duplication; also I would not be able to fire a a point query as mentioned here at https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#guidelines-for-table-design
GameLogsByTableId -- this will have TableId as PartitionKey and GUID as RowKey
GameLogsByGameId -- this will have GameId as PartitionKey and GUID as RowKey
GameLogsByUserId -- this will have UserId as PartitionKey and GUID as RowKey
GameLogsByDealId -- this will have DealId as PartitionKey and GUID as RowKey
Thoughts please?
Format of TableId,GameId,DealId and UserId is long.
I would like to query data such that
Get me all the logs from a TableId.
Get me all the logs from a TableId and in a particular game(GameId)
Get me all the logs of a user(userid) in this game(GameId)
Get me all the logs of a user in a deal(dealId)
Get me all the logs from a table on a date; similarly for a user,game and deal
Based on my knowledge so far on Azure Tables, I believe you're on right track.
However there are certain things I would like to mention:
You could use a single table for storing all data
You don't really need to use separate tables for storing each kind of data though this approach logically separates the data nicely. If you want, you could possibly store them in a single table. If you go with single table, since these ids (Game, Table, User, and Deal) are numbers what I would recommend is to prefix the value appropriately so that you can nicely identify them. For example, when specifying PartitionKey denoting a Game Id, you can prefix the value with G| so that you know it's the Game Id e.g. G|101.
Pre-pad your Id values with 0 to make them equal length string
You mentioned that your id values are long. However the PartitionKey value is of string type. I would recommend prepadding the values so that they are of equal length. For example, when storing Game Id as PartitionKey instead of storing them as 1, 2, 103 etc. store them as 00000000001, 00000000002, 00000000103. This way when you list all Ids, they will be sorted in proper order. Without prepadding, you will get the results as 1, 10, 11, 12....19, 20.
You will loose transaction support
Since you're using multiple tables (or even single table with different PartitionKeys), you will not be able to use Entity Batch Transactions available in Azure Tables and all the inserts need to be done as atomic operations. Since each operation is a network call and can possibly fail, you may want to do that through an idempotent background process which will keep on trying inserting the data into multiple tables till the time it succeeds.
Instead of Guid for RowKey, I suggest you create a composite RowKey based on other values
This is more applicable for update scenario. Since an update requires both PartitionKey and RowKey, I would recommend using a RowKey which is created as a composition of other values. For example, if you're using TableId as PartitionKey for GameLogsByTableId, I would suggest creating a RowKey using other values e.g. U|[UserId]|D|[DealId]|G|[GameId]. This way, when you get a record to update, you automatically know how to create a RowKey instead of fetching the data first from the table.
Partition Scans
I looked at your querying requirements and almost all of them would result in Partition Scans. To avoid that, I would suggest keeping even more duplicate copies of the data. For example, consider #3 and #4 in your querying requirements. In this case, you will need to scan the entire partition for a user to find information about a Game Id and Deal Id. So please be prepared for the scenario where table service returns you nothing but continuation tokens.
Personally, unless you have absolutely massive data requirements, I would not use table storage for this. It will make your job much harder than using an SQL database; you can use any index you like, have relational integrity, and so much more. The only thing in favour of ATS is that it's cheap for large data.

Storing list in cassandra

I want to save a friends list in Cassandra where a user may have few hundred of friends . Should i store the list of friends, which is an email id, as a list or set in Cassandra or should i create a separate table having the columns user_id and friends which will include all the user(millions of users) along with their friends .
If i create a separate table with user_id and friends column will there be degradation in performance while retrieving the entire friend list of the user/ one friend of the user as the table will contain many records/rows.
It is important to note that lists and sets in Cassandra are not iterable. This means when you query for them, you get back the whole list or the whole set. If the collection has a high cardinality then this could pose issues in querying such as read timeouts or even a heap OOM error.
Since it sounds like there is no cap on the amount of friends one can have, one option could be to have a separate table that is partitioned on user and clustered on friend.
CREATE TABLE user_friends (
owner_user_id int,
friend_user_id int,
PRIMARY KEY(owner_user_id, friend_user_id)
);
This will ensure that the friend_user_id is in order and will allow you to do client side paging if the number of friends is very large. It also allows for a quick way to check if a person is a friend of a user.

Resources