NoSQL - how to implement autosuggest and best matches properly? - search

We're building a database of cars and their properties, supposed to be stored in a DynamoDB.
Creating a cars table and filling it with objects that has properties like brand, model, year etc is easy.
But we also want a few other features en the admin interface:
Suggestions when typing
When creating a car, it should suggest brand and model from existing cars, when typing in the field.
Should we then maintain a list of brands and models in another table, and make a query to that table, when the user types?
Or is it good enough to query the "rich" table of car definitions, and get all values for brand, all model values where brand has a certain value, etc? My first thought is that it would be a heavy operation and we'd want a separate index of cars and models. But I'm not a NoSQL expert...
Best matches
When enrolling a new car in our system we want to use use an existing defined car as a reference if possible.
So when the user has typed in a brand, model, year etc we want to show a few options of the best matches - we can accept that they year etc. is different, but want the best matches first.
What is the best way to do matches like this on data in a NoSQL database? Any links to tools, concepts etc. will be appreciated :)
Thanks in advance

In dynamodb (all nosql), the less you create tables the best is your architecture (this is one of the main reason we use nosql), so no need of a new table, just add a new attribute and fill it with the searchable data you want, just have in mind that querying by dynamodb is case sensitive and you only can use the begins_with or the contains function to query data
The cons are :
You will use lot of reading capacity unit
You have to manage the capital letters
You have to fabric at each creation the searchable attribute
The solution I suggest is using aws cloudsearch, which gives an out of the boxes suggester, you will will have better results and give a better user experience, the indexation in cloudsearch is automatic each time you have a new item, but be aware of the pricing, however they will give you 30 day for free

Related

Create Mongoose Schema Dynamically for e-commerce website in Node

I would like to ask a question about a possible solution for an e-commerce database design in terms of scalability and flexibility.
We are going to use MongoDB and Node on the backend.
I included an image for you to see what we have so far. We currently have a Products table that can be used to add a product into the system. The interesting part is that we would like to be able to add different types of products to the system with varying attributes.
For example, in the admin management page, we could select a Clothes item where we should fill out a form with fields such as Height, Length, Size ... etc. The question is how could we model this way of structure in the database design?
What we were thinking of was creating tables such as ClothesProduct and many more and respectively connect the Products table to one of these. But we could have 100 different tables for the varying product types. We would like to add a product type dynamically from the admin management. Is this possible in Mongoose? Because creating all possible fields in the Products table is not efficient and it would hit us hard for the long-term.
Database design snippet
Maybe we should just create separate tables for each unique product type and from the front-end, we would select one of them to display the correct form?
Could you please share your thoughts?
Thank you!
We've got a mongoose backend that I've been working on since its inception about 3 years ago. Here some of my lessons:
Mongodb is noSQL: By linking all these objects by ID, it becomes very painful to find all products of "Shop A": You would have to make many queries before getting the list of products for a particular shop (shop -> brand category -> subCategory -> product). Consider nesting certain objects in other objects (e.g. subcategories inside categories, as they are semantically the same). This will save immense loading times.
Dynamically created product fields: We built a (now) big module that allows user to create their own databse keys & values, and assign them to different objects. In essence, it looks something like this:
SpecialFieldModel: new Schema({
...,
key: String,
value: String,
...,
})
this way, you users can "make their own products"
Number of products: Mongodb queries can handle huge dataloads, so I wouldn't worry too much about some tables beings thousands of objects large. However, if you want large reports on all the data, you will need to make sure your IDs are in the right place. Then you can use the Aggregation framework to construct big queries that might have to tie together multiple collectons in the db, and fetch the data in an efficient manner.
Don't reference IDs in both directions, unless you don't know what you're doing: Saving a reference to category ID in subcatgories and vice-versa is incredibly confusing. Which field do you have to update if you want to switch subcategories? One or the other? Or both? Even with strong tests, it can be very confusing for new developers to understand "which direction the queries are running in" (if you are building a proudct that might have to be extended in the future). We've done both which has led to a few problems. However, those modules that saved references to upper objects (rather than lower ones), I found to be consistently more pleasant and simple to work with.
created/updatedAt: Consider adding these fields to every single model & Schema. This will help with debugging, extensibility, and general features that you will be able to build in the future, which might otherwise be impossible. (ProductSchema.set('timestamps', true);)
Take my advice with a grain of salt, as I haven't designed most of our modules. But these are the sorts of things I consider as continue working on our applications.

Designing indices to have paging with filters and random page jump Elasticsearch

I just want to have an expert opinion about my use case and the way I am planning to use indices to see if there is no problem in my approach or if there are any better ways to achieve it. Since I am new to ES, your opinions would really help me. We are storing data in couchdb in different databases based for each type of data.
I have database that serves as a link between 2 databases. For example, database A has 'floor' data, database B that links floor to items and then separate database for each item that a floor can have (e.g., card reader, camera etc).
We need to search for items that are linked to a floor and get them with filtering and paging. (Right now my links database has only ids and type but I am also planning to save name for each type as well in links db so that I can have filtering while I can do paging).
The way I want to achieve filtering and paging in my datastore is, I'll just have indices for each db. So based on floor, i'll get all its linked items for a type and 'search filter' (from index of links db) that would give me a page of certain items, i'll then use ids from that result to get those full objects (from index of) db of that item type.
Please let me know if there is any better approach in handling that, like e.g., if I can create one index for my floor and links and item databases and is it possible to do that through logstash couchdb plugin.
Many thanks.
Your setup does not sound wrong, but there are alternatives. You can use nested objects or parent-child relationships for an easier setup. Both approaches have their advantages. It all depends on the type of queries that you would like to do, and the amount of items that are related.
I would start by reading he next section of the definitive guide, that should give you a good start.
https://www.elastic.co/guide/en/elasticsearch/guide/current/modeling-your-data.html?q=model

PouchDB structure

i am new with nosql concept, so when i start to learn PouchDB, i found this conversion chart. My confusion is, how PouchDB handle if lets say i have multiple table, does it mean that i need to create multiple databases? Because from my understanding in pouchdb a database can store a lot of documents, but a document mean a row in sql or am i misunderstood?
The answer to this question seems to be surprisingly under-documented. While #llabball clearly gave a decent answer, I don't think that views are always the way to go.
As you can read here in the section When not to use map/reduce, Nolan explains that for simpler applications, the key is to abuse _ids, and leverage the power of allDocs().
In other words, if you had two separate types (say artists, and albums), then you could prefix the id of each type to obtain an easily searchable data set. For example _id: 'artist_name' & _id: 'album_title', would allow you to easily retrieve artists in name order.
Laying out the data this way will result in better performance due to not requiring extra indexes, and less code. Clearly however, if your data requirements are more complex, then views are the way to go.
... does it mean that i need to create multiple databases?
No.
... a document mean a row in sql or am i misunderstood?
That's right. The SQL table defines column header (name and type) - that are the JSON property names of the doc.
So, all docs (rows) with the same properties (a so called "schema") are the equivalent of your SQL table. You can have as much different schemata in one database as you want (visit json-schema.org for some inspiration).
How to request them separately? Create CouchDB views! You can get all/some "rows" of your tabular data (docs with the same schema) with one request as you know it from SQL.
To write such views easily the property type is very common for CouchDB docs. Your known name from a SQL table can be your type like doc.type: "animal"
Your view names will be maybe animalByName or animalByWeight. Depends on your needs.
Sometimes multiple-databases plan is a good option, like a database per user or even a database per user-feature. Take a look at this conversation on CouchDB mailing list.

showing search results more efficiently?

I want to implement the auto-complete feature provided by various e-commerce stores. Functionality is pretty simple, when you type some characters, it start showing relevant suggestions.
I implemented it using solr (django-haystack), using the autocomplete method provided in haystack.query.SearchQuerySet . Basically, i get a list of results sorted by the score. Showing top n results as suggestions.
Solr document contains $product_name, $category_name and other fields. So the results which i generated looks like list of " in ".
Problem arise when i change the category name. If i change the category name, i have to update all the product belong to that particular category to reflect these changes in the auto-complete (update all documents in solr for products of this category).
Another way to do this is by putting just the id of the categories with product in the solr document. In that case, I have do look-up for category name each time, and this is not efficient.
Is there any other efficient way to do this?
Since you are changing the underlying data, the same has to be propogated to SOLR.
There are different approaches to do this:
Update the database, and reindex - Pros: Simple enough, Cons: Indexing time can be large.
Update database and Solr in tandem - Pros: Quick updates, almost instantaneous, Cons: Can lead to data inconsistency (if one update fails)
Update database, and schedule a delta-import in Solr. This is like a middle ground between the two above.
I would recommend the 3rd approach, but this would require some upfront schema design. Read more about delta import here, in context of DataImportHandler.

Why should (or shouldn't) a Search Query return back only document IDs?

So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...

Resources