Am New to dynamodb I have a table in DynamoDB with more than 100k items in it. Also, this table gets refreshed frequently. On this table, I want to be able to do something similar to this in the relation database world: how i can get max value from the table.
DynamoDB is a NoSQL database and therefore is very limited on how you can query data. It is not possible to perform aggregations such as max value from a table by directly calling the DynamoDB API. You will have to look to different tools and approaches to solve this problem.
There are a number of possible solutions you can consider:
Perform A Table Scan
With more than 100k items in your table this is likely a very bad idea. A table scan will read through every single item and you can have application side logic identify the maximum value. This really isn't a workable solution.
Materialized Index in DynamoDB
Depending on your use case you can use DynamoDB streams and a Lambda function to maintain an index in a separate DynamoDB table. If your table is write only, no updates and no deletions, you could store the maximum in a separate table and as new records get inserted you can compare them and perform the necessary updates.
This approach is workable under some constrained circumstances, but is not a generalized solution.
Perform Analytic using Amazon Redshift
DynamoDB is not meant to do analytical operations such as maximum, while Redshift is a very powerful big data platform that can perform these types of calculations with ease. Similar to the DynamoDB index, you can use DynamoDB streams to send data into Redshift as records get inserted to maintain a near real time copy of the table for analytical purposes.
If you are looking for more of an offline or analytical solution this is a good choice.
Perform Analytics using Elasticsearch
While DynamoDB is a powerful NoSQL solution with strong guarantees on data durability, Elasticsearch provides a very flexible querying method that allows for queries such as maximum and these aggregations can be sliced and diced on any attribute value in real time. Similar to the above solutions you can use DynamoDB streams to send record inserts updates and deletions into the Elasticsearch index in real time.
If you want to stick with DynamoDB but need some additional querying capability, this is really a good option especially when using the AWS ES service which will fully manage an Elasticsearch cluster for you. It is important to remember that Elasticsearch doesn't replace your DynamoDB table, it is just an easily searchable index of the same data.
Just use a SQL Database
The obvious solution is if you have SQL requirements then move from a NoSQL based system to a SQL based system. AWS's RDS offering provides a managed solution. While DynamoDB provides a lot of benefits, if your use case is pulling you towards a SQL solution the easiest thing to do may be to not fight it and just change solutions.
This is not to say that a SQL based solution or NoSQL based solution is better, there are pros and cons to each and those vary based on the specific use case, but it is definitely an option to consider.
DynamoDB actually does have a MAX aggregate function: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.Querying.html
Related
I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.
This is the context of my situation:
I have a huge DB in dynamoDB with 250.000 items. (Example) table
I want to be able to "substring search" through 3 attributes, getting the list of all items that match the substrings.
The attributes i want to be able to search can have the same value among different items.
My hash key is an id (the only attribute that really differentiates the items).
I'm using react native as a client
My schema has these "query types" queries
Where I am:
I first tried querying with the listCaballos query adding the user input as a filter to the query, and using the nextToken recursively to go over the whole table (without using secondary indexes), but it took 6 minutes to go through the table and return the items.
I know secondary indexes help to partition and then order the items through chosen keys (which makes it fast), buuuut I read that that forces the user to make an exact search (not a substring kind of search), and that's not what I need.
I've heard Elastic Search might help.
Any suggestions?
Thanks!
This is not efficient in DynamoDB. Although you can create secondary indexes to search 'begins_with', substring ('contains') capability is there only for filters which are not efficient in a large data set (Since DynamoDB will use IOPS to query all and then apply the filter).
This kind of a requirement, it is efficient to index the database using another service like AWS ElasticSearch or CloudSearch so that you can apply the query on top of that service and configure continuous indexing.
Getting Started
Searching DynamoDB Data with Amazon CloudSearch
Combining DynamoDB and Amazon Elasticsearch with Lambda
Indexing Amazon DynamoDB Content with Amazon Elasticsearch Service Using AWS Lambda
You will not be able to use secondary indexes to help create a (reasonable) generalized substring search.
There are many ways to solve your problem. Here, I present a few of them, and this is by no means exhaustive.
DynamoDB -> CloudSearch
CloudSearch can provide general search functionality for your data. Basically, you can connect a lambda function to the DynamoDB stream from your table. That lambda function can keep your CloudSearch domain up to date. Here is an overview of this process.
CloudSearch
You could forgo DynamoDB and store this data in CloudSearch. That eliminates the need for the lambda function and means your data is only stored in one place. However, you need to tolerate a higher time to consistency because CloudSearch doesn’t have strongly consistent reads like DynamoDB.
RDS
You could just use a SQL database of some sort. Most of them support a full text search. You can even use AWS Aurora Serverless if you don’t want to manage database instances.
We are using Cassandra 3 and have come up with a modelling based on the initial requirements. Since there have been very frequent requirements changes, this model has subsequently changed many times as well. Hence considering these requirements and model changes, there has been no major improvement in terms of development. The team have decided to go with the BLOB data type and store the entire data in the BLOB. Can you please share the drawback to use BLOB such a scenario. Thanks in Advance.
We migrated from Astyanax Cassandra 1.1 to CQL Cassandra 3.0 directly, so we still have a lot of column families which have value as BLOB.
Major issues we face right now are:
1) Difficult to visualize data directly from database: Biggest advantage of CQL is it supports SQL like queries, hence logging into cql terminal and getting results directly from there is saves a lot of time normally. If you use BLOB you will not be able to do all such things.
2) CQL performs better when your table has a well defined schema instead of using blob to store big chunk of data together.
If you are creating a new table, I will suggest to use Collections for your use case. You will be able to store different type of data and performance will also be good.
Nice slides comparing performance of schemaless tables and tables with scehma and collections. You can skip to slide 26 if you just want the summary.
https://www.slideshare.net/DataStax/migration-from-thrift-to-cql-brij-bhushan-ravat-ericsson-cassandra-summit-2016
we trying to build a data-ware house for our transaction system.
- We make 5000 -6000 transaction per day, they can go > 20,000.
- Each transaction produce a file, size (> 4MB)
we want to have a system, which can make updates to the existing data, consistent and availability, and have good read performance. Infrastructure is not any issue.
Hbase or cassandra or any other ? your help and guidance is highly appreciated.
Many thanks!
Most of newer nosql platform can do what you need in terms of performance - both hbase and cassandra scales horizontally (also Aerospike and others) so performances can be guaranteed if the data-model respect the "product-patterns" for data distribution.
I would not choose the technology in terms of performances.
What I would do is:
a list of different features offered by a bunch of products and then consider the one that, out of the box, best fit my needs
a list of operation I need to do on data and check if I am not going "against" some specific product
While 1 is easily done the 2 need a deep product analysis. For instance you say you need to update existing data -- let's imagine you choose Cassandra and you update very very frequently a column on which you put a secondary index (that, under the hood, creates a lookup table) for searching purpose. Any time you perform an update on this column on the lookup table a deletion and insertion is performed. You can read in this article that performing many deletes in Cassandra is considered an anti-pattern and can lead to problematic situations. This is just an example I did on Cassandra because is the one I know best among nosql products and not to tell you avoid Cassandra.
I am very newbie to this world of document db.
So... why this db are better than RDBMS ( like mysql or postgresql ) for very large amount of data ?
She have implement good indexing to carry this types of file, and this is designed for. This solution is better for Document Database, because is for it. Normal database is not designed to saving "documents", in this option you must hard work to search over your documents data, because each can be in other format this is a lot of work. If you choice document db solution you have all-in-one implemented because this database is for only "docuemnts", because this have implementation of these needed for it functions.
You want to distribute your data over multiple machines when you have a lot of data. That means that joins become really slow because joining between data on different machines means a lot of data communication between those machines.
You can store data in a mongodb/couchdb document in a hierarchical way so there is less need for joins.
But is is dependent on you use case(s). I think that relational databases do a better job when it comes to reporting.
MongoDB and CouchDB don't support transactions. Do you or your customers need transactions?
What do you want to do? Analyzing a lot of data (business intelligence/reporting) or a lot of small modifications per second "HVSP (High Volume Simple Processing)"?