Efficient substring Search in DynamoDB

Efficient substring Search in DynamoDB - search

This is the context of my situation:
I have a huge DB in dynamoDB with 250.000 items. (Example) table
I want to be able to "substring search" through 3 attributes, getting the list of all items that match the substrings.
The attributes i want to be able to search can have the same value among different items.
My hash key is an id (the only attribute that really differentiates the items).
I'm using react native as a client
My schema has these "query types" queries
Where I am:
I first tried querying with the listCaballos query adding the user input as a filter to the query, and using the nextToken recursively to go over the whole table (without using secondary indexes), but it took 6 minutes to go through the table and return the items.
I know secondary indexes help to partition and then order the items through chosen keys (which makes it fast), buuuut I read that that forces the user to make an exact search (not a substring kind of search), and that's not what I need.
I've heard Elastic Search might help.
Any suggestions?
Thanks!

This is not efficient in DynamoDB. Although you can create secondary indexes to search 'begins_with', substring ('contains') capability is there only for filters which are not efficient in a large data set (Since DynamoDB will use IOPS to query all and then apply the filter).
This kind of a requirement, it is efficient to index the database using another service like AWS ElasticSearch or CloudSearch so that you can apply the query on top of that service and configure continuous indexing.
Getting Started
Searching DynamoDB Data with Amazon CloudSearch
Combining DynamoDB and Amazon Elasticsearch with Lambda
Indexing Amazon DynamoDB Content with Amazon Elasticsearch Service Using AWS Lambda

You will not be able to use secondary indexes to help create a (reasonable) generalized substring search.
There are many ways to solve your problem. Here, I present a few of them, and this is by no means exhaustive.
DynamoDB -> CloudSearch
CloudSearch can provide general search functionality for your data. Basically, you can connect a lambda function to the DynamoDB stream from your table. That lambda function can keep your CloudSearch domain up to date. Here is an overview of this process.
CloudSearch
You could forgo DynamoDB and store this data in CloudSearch. That eliminates the need for the lambda function and means your data is only stored in one place. However, you need to tolerate a higher time to consistency because CloudSearch doesn’t have strongly consistent reads like DynamoDB.
RDS
You could just use a SQL database of some sort. Most of them support a full text search. You can even use AWS Aurora Serverless if you don’t want to manage database instances.

Related

How can we dynamically Sorting & Grouping data from DynamoDB

I am new to DynamoDB and i see that data sorting done by the sort key value. Is there a way how we can sort data for any attribute with in an item irrespective of sortkey. And i am using nodejs with DynamoDB for my project.
May i know how can i achieve the same.
Thank you

So long as you don't need to re-partition your data, the first way to do this is to create a LSI (local secondary index) with your new sort key.
But unlike SQL you can't do ad hoc queries, when you design how you are setting up your table you already have to know much about what queries and transactions you will have, in general you shouldn't need to make an LSI every time you want to search, I recommend watching this.

Google Datastore query filter for multiple values for same property

I have a query I wish to run on google Datastore that is intended to retrieve data from multiple devices. However, I couldn't find anywhere in the documentation that would allow me to get data from e.g. device-1 or device-2 or device-3, i.e. only 1 property name can be set. Is this a Datastore limitation? Or am I just missing something that I don't know about?
Based on the NodeJS client library, the query might look something like the below filter criteria:
var query = datastore.createQuery('data')
.filter('device_id', 1)
.filter('device_id', 2)
.filter('device_id', 3);
Otherwise, I might have to run separate queries for the various devices, which doesn't seem like a very elegant solution, especially if there are a lot of devices to simultaneously run queries on.
Any suggestions for the Datastore API or alternative approaches are welcome!

Yes, this would be an OR operation which is one of the Restrictions on queries (emphasis mine):
The nature of the index query mechanism imposes certain restrictions
on what a query can do. Cloud Datastore queries do not support
substring matches, case-insensitive matches, or so-called full-text
search. The NOT, OR, and != operators are not natively
supported, but some client libraries may add support on top of Cloud
Datastore. Additionally:

Look ahead search on document fields in azure DocumentDb

We are interested in using DocumentDb as a data store for a number of data sources and as such we are running a quick POC to establish whether it meets the criteria we are looking for.
One of the areas we are keen to provide is look ahead search capabilities for certain fields. These are traditionally provided using the SQL LIKE syntax which does not appear to be supported at present.
Searching online I have seen people talking about integrating Azure search but this appears to be a very costly mechanism for such a simple use case.
I have also seen people mention the use of UDF's but this appears to require an entire collection scan which is not practical from a performance perspective.
Does anyone have any alternative suggestions? One thing I considered was simply using a SQL table and initiating an update each time a document was inserted\updated\deleted?

DocumentDB supports STARTSWITH and range indexes to support prefix/look ahead searching.
You can progressively make queries like the following based on what your user types in a text box:
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "H")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hi")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hil")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hilton")
Note that you must configure the collection, or the path/property you're using for these queries with a range index. You can extend this approach to handle additional cases as well:
To query in a case-insensitive manner, you must store the lower case form of the search property, and use that for querying.

I faced a similar situation, where a fast lookup was required, as a user typed search terms.
My scenario was that potentially thousands of simultaneous users would be performing such lookups; when testing this under load, to avoid saturation and throttling, we found we would have to increase the DocumentDB Request Unit (RU) throughput amount to a point that was not financially viable for us, in our specific circumstances.
We decided that DocumentDB was best used as the persistent store, and 'full' data retrieval - and this role it performs exceptionally well - while a small ElasticSearch cluster performed the role it was designed for - text search, faceted search, weighted search, stemming, and most relevant to your question, autocomplete analyzersand completion suggesters.
The subject of type ahead queries, creation of indexes, autocomplete analyzer and query time 'search as you type' in ElasticSearch can be found here, here and here
The fact that you plan to have several data sources would also potentially make the ElasticSearch cluster approach more attractive, to aggregate search data.
I used the Bitnami template available in the Azure market place to create relatively small instances, and most importantly, this allowed me to place the cluster on the same Virtual Network as my other components, which greatly increased performance.
Cost was lower than Azure Search (which uses ElasticSearch under the hood).

full text search in databases

I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks

Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you

(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.

How to use aggregate functions in Amazon Dynamodb

Am New to dynamodb I have a table in DynamoDB with more than 100k items in it. Also, this table gets refreshed frequently. On this table, I want to be able to do something similar to this in the relation database world: how i can get max value from the table.

DynamoDB is a NoSQL database and therefore is very limited on how you can query data. It is not possible to perform aggregations such as max value from a table by directly calling the DynamoDB API. You will have to look to different tools and approaches to solve this problem.
There are a number of possible solutions you can consider:
Perform A Table Scan
With more than 100k items in your table this is likely a very bad idea. A table scan will read through every single item and you can have application side logic identify the maximum value. This really isn't a workable solution.
Materialized Index in DynamoDB
Depending on your use case you can use DynamoDB streams and a Lambda function to maintain an index in a separate DynamoDB table. If your table is write only, no updates and no deletions, you could store the maximum in a separate table and as new records get inserted you can compare them and perform the necessary updates.
This approach is workable under some constrained circumstances, but is not a generalized solution.
Perform Analytic using Amazon Redshift
DynamoDB is not meant to do analytical operations such as maximum, while Redshift is a very powerful big data platform that can perform these types of calculations with ease. Similar to the DynamoDB index, you can use DynamoDB streams to send data into Redshift as records get inserted to maintain a near real time copy of the table for analytical purposes.
If you are looking for more of an offline or analytical solution this is a good choice.
Perform Analytics using Elasticsearch
While DynamoDB is a powerful NoSQL solution with strong guarantees on data durability, Elasticsearch provides a very flexible querying method that allows for queries such as maximum and these aggregations can be sliced and diced on any attribute value in real time. Similar to the above solutions you can use DynamoDB streams to send record inserts updates and deletions into the Elasticsearch index in real time.
If you want to stick with DynamoDB but need some additional querying capability, this is really a good option especially when using the AWS ES service which will fully manage an Elasticsearch cluster for you. It is important to remember that Elasticsearch doesn't replace your DynamoDB table, it is just an easily searchable index of the same data.
Just use a SQL Database
The obvious solution is if you have SQL requirements then move from a NoSQL based system to a SQL based system. AWS's RDS offering provides a managed solution. While DynamoDB provides a lot of benefits, if your use case is pulling you towards a SQL solution the easiest thing to do may be to not fight it and just change solutions.
This is not to say that a SQL based solution or NoSQL based solution is better, there are pros and cons to each and those vary based on the specific use case, but it is definitely an option to consider.

DynamoDB actually does have a MAX aggregate function: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.Querying.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string