I am new to DynamoDB and i see that data sorting done by the sort key value. Is there a way how we can sort data for any attribute with in an item irrespective of sortkey. And i am using nodejs with DynamoDB for my project.
May i know how can i achieve the same.
Thank you
So long as you don't need to re-partition your data, the first way to do this is to create a LSI (local secondary index) with your new sort key.
But unlike SQL you can't do ad hoc queries, when you design how you are setting up your table you already have to know much about what queries and transactions you will have, in general you shouldn't need to make an LSI every time you want to search, I recommend watching this.
Related
I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.
As the title says, I'm trying to query all the data I got with no value stored in it. I've been searching for a while, and the only operation allowed that I've found is CONTAINS, which doesn't fit my need.
consider the following table:
CREATE TABLE environment(
id uuid,
name varchar,
message text,
public Boolean,
participants set<varchar>,
PRIMARY KEY (id)
)
How can I get all entries in the table with an empty set? E.g. participants = {} or null?
Unfortunately, you really can't. Cassandra makes queries like this difficult by design, because there's no way it can be done without doing a full table scan (scanning each and every node). This is why a big part of Cassandra data modeling is understanding all the ways that table will be queried, and building it to support those queries.
The other issue that you'll have to deal with, is that (generally speaking) Cassandra does not allow filtering by nulls. Again, it's a design choice...it's much easier to query for data that exists, rather than data that does not exist. Although, when writing with lightweight transactions, there are ways around this one (using the IF clause).
If you knew all of the ids ahead of time, you could write something to iterate through them, SELECT and check for null on the app-side. Although that approach will be slow (but it won't stress the cluster). Probably the better approach, is to use a distributed OLAP layer like Apache Spark. It still wouldn't be fast, but this is probably the best way to handle a situation like this.
I am considering choosing between DynamoDB and AWS Keyspaces.
My main issue is still with many-to-many relationship in Dynamo. You don't really have too nice options. Either you do adjecency list for immutable data...but in most scenarios data is gonna change. Other way is making 2 db calls which is really not that great. Third option would be to update data all the time which seems also like a big pain in the a**. Also for batch writes it's up to 25 rows I think.
However Cassandra provides materialized views where at least I don't have to manage replication on my own. Also I can do 1 DB call to get all I need.
I am still relatively new to NoSQL databases so I might be missing a lot of stuff.
Are there plans for Dynamo to add Materialized Views or is there better way to do it?
In my eyes it seems like a really good feature. It doesn't even have to create new tables, rather references between columns of items to make it autoupdate.
DynamoDB has a feature called Global Secondary Index which is very close to the materialized view feature of Cassandra. Despite its confusing name, DynamoDB's GSI is not just an index like what Cassandra calls a "secondary index"! It doesn't just like the keys matching a particular column value: Beyond the keys it can also keep any other items attributes which you choose to project. Exactly like a materialized view.
DynamoDB also has a more efficient Local Secondary Index which you can consider if the view's partition key is the same as the base table's - and you just want to sort items differently or project only part of the attributes.
I'm new to Cassandra, so I read a dozen articles about it and thus I know the basics. All the tutorials show efficient data retrieval by 1 or 2 columns and a time range. What I could not find was how to correctly model your data if you have more conditions.
I have a big events normalised database, with quite a few columns, say:
Event type
time
email
User_age
user_country
user_language
and so on.
I would need to be able to query by all columns. So in RDBMS I would query:
SELECT email FROM table WHERE time > X AND user_age BETWEEN X AND X AND user_language = 'nl' etc..
I know I can make a separate table for each column, but then I would still need to combine the results. Maybe this is not a bad approach, but I doubt it since there are no subqueries.
My question is obviously, how can I model this kind of data correctly in Cassandra?
Thanks a lot!
I would need to be able to query by all columns.
Let me stop you right there. In Cassandra, you create your tables based on your anticipated query patterns, and usually a table supports a single query. In your case, you have "quite a few" columns and you will need to duplicate that data into a table designed to support each possible query. That is going to get big and ungainly, very quickly.
Could we just add the rest as secondary indexes? there could potentially still be millions of rows in the eventtype table + merchant_id + time selection.
Secondary indexes are intended to be used on middle-of-the-road cardinality columns. So both, extremely low and extremely high cardinality columns are bad for secondary indexes. The problem, is that Cassandra will have to pick one of your nodes as a coordinator, scan the index on each node (incurring lots of network time), and then build and return the result set. It's a prescription for poor performance, that flies in-the-face of the best practices for working with a distributed database.
In short, Cassandra is not a good solution for use cases like this. It sounds like you want to be able to do OLAP-type queries, and for that you should use a tool that is better-suited for that purpose.
Am New to dynamodb I have a table in DynamoDB with more than 100k items in it. Also, this table gets refreshed frequently. On this table, I want to be able to do something similar to this in the relation database world: how i can get max value from the table.
DynamoDB is a NoSQL database and therefore is very limited on how you can query data. It is not possible to perform aggregations such as max value from a table by directly calling the DynamoDB API. You will have to look to different tools and approaches to solve this problem.
There are a number of possible solutions you can consider:
Perform A Table Scan
With more than 100k items in your table this is likely a very bad idea. A table scan will read through every single item and you can have application side logic identify the maximum value. This really isn't a workable solution.
Materialized Index in DynamoDB
Depending on your use case you can use DynamoDB streams and a Lambda function to maintain an index in a separate DynamoDB table. If your table is write only, no updates and no deletions, you could store the maximum in a separate table and as new records get inserted you can compare them and perform the necessary updates.
This approach is workable under some constrained circumstances, but is not a generalized solution.
Perform Analytic using Amazon Redshift
DynamoDB is not meant to do analytical operations such as maximum, while Redshift is a very powerful big data platform that can perform these types of calculations with ease. Similar to the DynamoDB index, you can use DynamoDB streams to send data into Redshift as records get inserted to maintain a near real time copy of the table for analytical purposes.
If you are looking for more of an offline or analytical solution this is a good choice.
Perform Analytics using Elasticsearch
While DynamoDB is a powerful NoSQL solution with strong guarantees on data durability, Elasticsearch provides a very flexible querying method that allows for queries such as maximum and these aggregations can be sliced and diced on any attribute value in real time. Similar to the above solutions you can use DynamoDB streams to send record inserts updates and deletions into the Elasticsearch index in real time.
If you want to stick with DynamoDB but need some additional querying capability, this is really a good option especially when using the AWS ES service which will fully manage an Elasticsearch cluster for you. It is important to remember that Elasticsearch doesn't replace your DynamoDB table, it is just an easily searchable index of the same data.
Just use a SQL Database
The obvious solution is if you have SQL requirements then move from a NoSQL based system to a SQL based system. AWS's RDS offering provides a managed solution. While DynamoDB provides a lot of benefits, if your use case is pulling you towards a SQL solution the easiest thing to do may be to not fight it and just change solutions.
This is not to say that a SQL based solution or NoSQL based solution is better, there are pros and cons to each and those vary based on the specific use case, but it is definitely an option to consider.
DynamoDB actually does have a MAX aggregate function: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.Querying.html