I am using DynamoDB for my Alexa skill. In the documentation for dynamoDB, it says that the primary key (and any secondary indexes) has to be one of three types: binary, string, number. I was wondering if there was a way to search the database using an array, or things like "tags" to try and match an item in the database with the most matching "tags" used to search the items. If this is not possible with dynamoDB, are there other databases that allow this functionality? Otherwise, what kind of service could I use (besides a database) that would allow me to do this kind of querying?
DynamoDB has been designed for fast read/write and huge scaling. The best way to use dynamoDB is to dump a system of record data and then access whole object using some id.
Some compromises have been made to deliver the speed. One of them is complex queries. For your use case, i think ElasticSearch is the best option.
With DynamoDB you could achieve this by having your primary key composed of
partition key (some unique id for each item), and
sort key (the specific tag)
This leads to having duplicate data stored, since you'd need to store the item data for each of the tags in order to allow fast queries per keys.
The structure would be something like this
Partition (ID) | Sort (Tag) | other attributes
1234 | node.js | { timestamp: "...", message: "...", ... }
1234 | database | { timestamp: "...", message: "...", ... }
1234 | alexa | { timestamp: "...", message: "...", ... }
Note that the partition key (ID) is same for each item, but the Sort key (Tag) changes. Other attributes can be anything you like, but in this case it is duplicated. Other items would be added in a similar manner with their unique id as partition key and tags as sort key, one per item.
This model is really optimized for fast reads. When a tag is deleted from an item, you'd delete the item accordingly.
But then some data in the item is changes, message attribute for example, you'd need to change each item, resulting in multiple writes. Also, the writes would not be atomic, resulting in some data possibly being stale.
Of course, it all depends on what other data query needs you application has and the amount of reads vs. writes you'll have, whether this would be valid approach or not.
Related
I'm currently developing a skill which will be a Stock portfolio manager.
I want the table to have three headings
UserID (which comes from Alexa)
Stock name
Stock amount
UserID is the primary key, and currently I can add an amount of a stock to the table and then in a separate method called GetPortfolioValue I query the database to return the Stock name and Stock amount for a specific UserID, which I will then do some maths on to return the Portfolio value.
The problem is, I can't add another entry under the primary key and so the portfolio can only have one Stock option in it, which sucks
I want to later be able to edit the portfolio in case a user sells part of their stocks.
Is there a way I can do this using DynamoDB?
Primary keys must be unique. You probably need a compound primary key of UserID + Stock Name. In DynamoDB that is accomplished by setting UserID to the hash key, and Stock Name to the sort key, on the primary table index.
You can have several items with same primary key when table has sort key. If table has only primary key without sort key - then no. If table has sort key, then each primary & sort keys combo must be unique.
In this table, PK attribute is primary key, SK - sort key.
Yet PKs will be placed physically together, there shouldn't be PKs like "active" & "not_active" while there are only 2 possible values for that row item and all of them will fall into only 2 spaces. If you will have a lot of rows with same primary key, you will create heat spot for queries and you may experience slow queries. But it's way too broad topic to discuss how to design DynamoDB table.
You will probably benefit from my article: https://lukasliesis.medium.com/learning-from-aws-fcb0cc71926b
There are two brilliant videos by Rick Houlihan mentioned on DynamoDB Architecture:
https://www.youtube.com/watch?v=HaEPXoXVf2k
https://www.youtube.com/watch?v=6yqfmXiZTlM
I highly recommend watching both for multiple times to get into mindset of DynamoDB, understand queries heat maps, how to put all app into single table and use DynamoDB to it's best.
Update: Adaptive Capacity
Probably most common issue which once was in DynamoDB.
If you’ve used DynamoDB before, you’re probably aware that DynamoDB recommends building your application to deliver evenly distributed traffic for optimal performance. That is, your requests should be evenly distributed across your primary key. This is because before adaptive capacity, DynamoDB allocated read and write throughput evenly across partitions. For example, if you had a table capable of 400 writes per second (in other words, 400 write capacity units, or “WCUs”) distributed across four partitions, each partition would be allocated 100 WCUs. If you had a nonuniform workload with one partition receiving more than 100 writes per second (a hot partition), those requests might have returned a ProvisionedThroughputExceededException error.
Read full article here:
https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Yes, assuming you're using the userId from the requests from Alexa, they will provide an excellent primary key for your dynamoDB database.
"user": {
"userId": "amzn1.ask.account.[unique-value-here]",
Simply store the stocks as a JSON object against each user.
{
"stocks": [
{
"stockAmount": 45,
"stockName": "AMZN"
}
]
}
Storing the Object
DynamoDB dynamo = new DynamoDB(new AmazonDynamoDBClient(...));
Table stocks = dynamo.getTable("stocks");
String json = {"stocks": [{"stockAmount": 45,"stockName": "AMZN"}]}
Item item =
new Item()
.withPrimaryKey("alexa_id", 1)
.withJSON("document", json);
table.putItem(item);
and how to get it back
DynamoDB dynamo = new DynamoDB(new AmazonDynamoDBClient(...));
Table table = dynamo.getTable("stocks");
Item documentItem =
table.getItem(new GetItemSpec()
.withPrimaryKey("alexa_id", 1)
.withAttributesToGet("document"));
System.out.println(documentItem.getJSONPretty("document"));
As your users add or remove stocks you'll need to append to the stocks array for the user. You'll be subject to a 400KB limit for each of your DynamoDB items The size of a given item includes the attribute name (in UTF-8) and the attribute value.
DocumentDB is truly schema less implies it will not enforce you to have only certain kind of document and it's a choice by user.
Considering the above point in a partitioned collection (lets say department as partitionedKey in Employee document) the user decide not to pass any partitionKey(just a use case to support my point, a new employee/intern joined and not yet decided on which department he will work and later they might update the document with appropriate department)
Based on the above scenario my question is, In the interim period, to which partition the new employee will go/persist, as I don't have a department(partitionKey) for him?
{
"eid": "",
"entryType": "",
"address":
{
"PIN": "",
"city": "",
"street": ""
},
"name": "",
"id": "",
"age":
}
Great question! We had same query when we started working with partitioned collections.
Based on my understanding, it is certainly possible to create a document in a partitioned collection without specifying partition key attribute (departmentId in your case) though it is not recommended.
When such things happen, Cosmos DB puts such documents in a special partition that are accessible by specifying {} i.e. empty JavaScript object as partition key in your query.
However, please keep in mind that you can't update the partition key attribute value (an employee is assigned a department in your example). You must delete and recreate the document with correct partition key.
If you're using a partitioned collection then you'll be required to provide a value for partitionKey on every insert. The choice of what to use as a partitionKey is based on the needs of your application and the nature of your data. There is no one size fits all answer. From the sounds of things you might want to reconsider using department as a partitionKey.
I am working as freelancer and right now working on one of my game and trying to use Azure table service to log my user moves in Azure tables.
The game is based on Cards.
The flow is like this:
Many users(UserId) will be playing on a table(TableId). Each game on the table will have a unique GameId. In each game there could be multiple deals with Unique DealId.
There can be multiple deals on the same table with same gameId. Also each user will have same DealId in a single game.
Winner is decided after multiple chances of a player.
Problem:
I can make TableId as PartitionKey and but I am not sure what to chose for RowKey because combination of TableId and RowKey (GameId/UserId/DealId) should be unique in the table.
I can have entries like:
TableId GameId DealId UserId timestamp
1 201 300 12345
1 201 300 12567
May be what I can do is to create 4 Azure tables like below but I am doing a lot of duplication; also I would not be able to fire a a point query as mentioned here at https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#guidelines-for-table-design
GameLogsByTableId -- this will have TableId as PartitionKey and GUID as RowKey
GameLogsByGameId -- this will have GameId as PartitionKey and GUID as RowKey
GameLogsByUserId -- this will have UserId as PartitionKey and GUID as RowKey
GameLogsByDealId -- this will have DealId as PartitionKey and GUID as RowKey
Thoughts please?
Format of TableId,GameId,DealId and UserId is long.
I would like to query data such that
Get me all the logs from a TableId.
Get me all the logs from a TableId and in a particular game(GameId)
Get me all the logs of a user(userid) in this game(GameId)
Get me all the logs of a user in a deal(dealId)
Get me all the logs from a table on a date; similarly for a user,game and deal
Based on my knowledge so far on Azure Tables, I believe you're on right track.
However there are certain things I would like to mention:
You could use a single table for storing all data
You don't really need to use separate tables for storing each kind of data though this approach logically separates the data nicely. If you want, you could possibly store them in a single table. If you go with single table, since these ids (Game, Table, User, and Deal) are numbers what I would recommend is to prefix the value appropriately so that you can nicely identify them. For example, when specifying PartitionKey denoting a Game Id, you can prefix the value with G| so that you know it's the Game Id e.g. G|101.
Pre-pad your Id values with 0 to make them equal length string
You mentioned that your id values are long. However the PartitionKey value is of string type. I would recommend prepadding the values so that they are of equal length. For example, when storing Game Id as PartitionKey instead of storing them as 1, 2, 103 etc. store them as 00000000001, 00000000002, 00000000103. This way when you list all Ids, they will be sorted in proper order. Without prepadding, you will get the results as 1, 10, 11, 12....19, 20.
You will loose transaction support
Since you're using multiple tables (or even single table with different PartitionKeys), you will not be able to use Entity Batch Transactions available in Azure Tables and all the inserts need to be done as atomic operations. Since each operation is a network call and can possibly fail, you may want to do that through an idempotent background process which will keep on trying inserting the data into multiple tables till the time it succeeds.
Instead of Guid for RowKey, I suggest you create a composite RowKey based on other values
This is more applicable for update scenario. Since an update requires both PartitionKey and RowKey, I would recommend using a RowKey which is created as a composition of other values. For example, if you're using TableId as PartitionKey for GameLogsByTableId, I would suggest creating a RowKey using other values e.g. U|[UserId]|D|[DealId]|G|[GameId]. This way, when you get a record to update, you automatically know how to create a RowKey instead of fetching the data first from the table.
Partition Scans
I looked at your querying requirements and almost all of them would result in Partition Scans. To avoid that, I would suggest keeping even more duplicate copies of the data. For example, consider #3 and #4 in your querying requirements. In this case, you will need to scan the entire partition for a user to find information about a Game Id and Deal Id. So please be prepared for the scenario where table service returns you nothing but continuation tokens.
Personally, unless you have absolutely massive data requirements, I would not use table storage for this. It will make your job much harder than using an SQL database; you can use any index you like, have relational integrity, and so much more. The only thing in favour of ATS is that it's cheap for large data.
Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.
I have this structure that I want a user to see the other user's feeds.
One way of doing it is to fan out an action to all interested parties's feed.
That would result in a query like select from feeds where userid=
otherwise i could avoid writing so much data and since i am already doing a read I could do:
select from feeds where userid IN (list of friends).
is the second one slower? I don't have the application yet to test this with a lot of data/clustering. As the application is big writing code to test a single node is not worth it so I ask for your knowledge.
If your title is correct, and userid is a secondary index, then running a SELECT/WHERE/IN is not even possible. The WHERE/IN clause only works with primary key values. When you use it on a column with a secondary index, you will see something like this:
Bad Request: IN predicates on non-primary-key columns (columnName) is not yet supported
Also, the DataStax CQL3 documentation for SELECT has a section worth reading about using IN:
When not to use IN
The recommendations about when not to use an index apply to using IN
in the WHERE clause. Under most conditions, using IN in the WHERE
clause is not recommended. Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.
As for your first query, it's hard to speculate about performance without knowing about the cardinality of userid in the feeds table. If userid is unique or has a very high number of possible values, then that query will not perform well. On the other hand, if each userid can have several "feeds," then it might do ok.
Remember, Cassandra data modeling is about building your data structures for the expected queries. Sometimes, if you have 3 different queries for the same data, the best plan may be to store that same, redundant data in 3 different tables. And that's ok to do.
I would tackle this problem by writing a table geared toward that specific query. Based on what you have mentioned, I would build it like this:
CREATE TABLE feedsByUserId
userid UUID,
feedid UUID,
action text,
PRIMARY KEY (userid, feedid));
With a composite primary key made up of userid as the partitioning key you will then be able to run your SELECT/WHERE/IN query mentioned above, and achieve the expected results. Of course, I am assuming that the addition of feedid will make the entire key unique. if that is not the case, then you may need to add an additional field to the PRIMARY KEY. My example is also assuming that userid and feedid are version-4 UUIDs. If that is not the case, adjust their types accordingly.