Data relationships as a context for search in Marklogic - search

I using marklogic's search functionality to create a search page. As of right now, I'm running an XQuery to get search results through search:search. As a bare bones example, see this code:
xquery version "1.0-ml";
import module namespace search = "http://marklogic.com/appservices/search"
at "/MarkLogic/appservices/search/search.xqy";
search:search('test',
<options xmlns='http://marklogic.com/appservices/search'></options>)
This search searches all content in the database, which is fine in many cases. In other cases, I search based on collections with cts:collection-query. The collections serve as great contexts for my searches.
Now, I would like to limit my search results based on a relationship of data in a "main" document. This "main" document has all the relationships in an object model. If that object model has a reference to a document, I want that document included in the search. Essentially, the "main"/model document is the context of the search.
I was trying to brainstorm some ideas of the best way to to this. Here's what I've come up with thus far, but I was hoping someone more familiar with Marklogic (I've only been working with it for 6 months) could lead me in a good direction:
Add all documents referenced in the model document to a unique collection. Then query search based on that collection. However, the collections would have to be updated as the model changed.
Load the model document into my code and get a list of all the references and add them to a query by cts:document-query (or the like).
Restructure my concept of a "model" somehow in my XML documents.
Thanks for any input or suggestions.

I would start with (2) and see if the performance is good enough. That will depend on your use-case, but I expect it should be fine for thousands or even hundreds of thousands of references.
Be sure to use a single-term cts:document-query($list-of-references). That will be faster than cts:or-query(for $ref in $list-of-references return cts:document-query($ref)), because the index lookup can be a single pass instead of N separate lookups.

All of these ideas would work fine. Deciding which to use depends on particulars of your application such as how often the main document is changed (and are you in control of it),
how hard it is to remodel your XML.
Another thing to consider is you can set a trigger on document updates which could perform the collection changes automatically.
-David Lee

Related

MongoDB - When to add SubDocuments and when to Ref

Im using MongoDB for storing information for a nodeJS application and a doubt came to my mind, after finding that it is possible to use ObjectID to ref another document. As it is known, MongoDB is a no-SQL db, so there is no need for consistency whatsoever and information can be repeated.
So, lets say, I have a collection for users and one of their field values is 'friends', which is an array of this user friends (another users). What is the best practice, saving all the user info there (thus repeating the same thing over and over again throughout the DB) or saving only the ObjectID of the friendUser (makes way more sense to me, but it sounds kinda SQL mindset). I'm not really getting when should I use each of the options, so a professional opinion would be very appreciated.
To model relationships between connected data, you can reference a document or embed it in another document as a subdocument.
Referencing a document does not create a “real” relationship between these two documents as does with a relational database.
Referencing documents is also known as normalization. It is good for data consistency but creates more queries in your system.
Embedding documents is also known as denormalization.
The benefit of Embedding approach is getting all the data you need about a document and it’s sub-document(s) with a single query. Therefore, this approach is very fast. The drawback is that data may not stay as consistent in the database.
Important
If one document is to be used by many documents then better create a referenced doc.
i. Will Save Space.
ii. if any change required, we will have to update only the referenced doc
instead of updating many docs.
Create sub doc(embedded)
i. If another document is not dependent on the subdocument.
Source: https://vegibit.com/mongoose-relationships-tutorial/
Recommended reading:
MongoDB Applied Design Patterns by Rick Copeland
To Embed or Reference

Siren "more like this" query

i am using the newest Siren distribution for Solr to index my data and search it. (http://siren.solutions/siren/downloads/)
Is there a simple way to search similar documents in my indexed data. Something similar to the MoreLikeThis query of Solr (https://cwiki.apache.org/confluence/display/solr/MoreLikeThis).
My goal is to find documents that have a similar json structure that the one i am interested in.
best,
Bernd
If I remember SIREn stores the RDF representation of each resource within a dedicated field of the Solr document. I don't think the default MLT component that comes with Solr works for your scenario.
I mean, enabling that component will produce some kind of result but I don't believe that it will follow your json "similarity" requirement.
On top of that I suggest you to post your request on SIREn mailing list [1]: I'm sure the dev team will address you on the right path.
[1] https://groups.google.com/forum/m/#!forum/siren-user

showing search results more efficiently?

I want to implement the auto-complete feature provided by various e-commerce stores. Functionality is pretty simple, when you type some characters, it start showing relevant suggestions.
I implemented it using solr (django-haystack), using the autocomplete method provided in haystack.query.SearchQuerySet . Basically, i get a list of results sorted by the score. Showing top n results as suggestions.
Solr document contains $product_name, $category_name and other fields. So the results which i generated looks like list of " in ".
Problem arise when i change the category name. If i change the category name, i have to update all the product belong to that particular category to reflect these changes in the auto-complete (update all documents in solr for products of this category).
Another way to do this is by putting just the id of the categories with product in the solr document. In that case, I have do look-up for category name each time, and this is not efficient.
Is there any other efficient way to do this?
Since you are changing the underlying data, the same has to be propogated to SOLR.
There are different approaches to do this:
Update the database, and reindex - Pros: Simple enough, Cons: Indexing time can be large.
Update database and Solr in tandem - Pros: Quick updates, almost instantaneous, Cons: Can lead to data inconsistency (if one update fails)
Update database, and schedule a delta-import in Solr. This is like a middle ground between the two above.
I would recommend the 3rd approach, but this would require some upfront schema design. Read more about delta import here, in context of DataImportHandler.

Tagging and Analysing a Search Query

I'm developing a search engine which functions taking the semantics of data into account, unlike the usual keyword based index. I managed to develop a reasonable index for the search using metadata extraction methods and RDF, but I have difficulty in using such methods on the search query itself since the search query is very much shorter that the actual data. any idea how to perform a successful tagging of a search query, using similar methods, natural language processing, etc. ?
Thank You!
Yes, the sample size of a typical query is too small for semantic analysis to be of any value.
One approach might be to constrain or expand your query using drop-down menus for things like "Named Entities" or "Subject Verb Object" tuples.
Another approach would be to expand simple keywords using rules created from your metadata so that, for example, a query for 'car' might be expanded to the tuple pattern
(*,[drive,operate,sell],[car,automobile,vehicle])
before submission.
Finally, you might try expanding the query with a non-semantically valuable prefix and/or suffix to get the query size large enough to trigger OpenCalais' recognizer.
Something like 'The user has specified the following terms in her query: one, two, three.'.
And once the results are returned, filter out all results that match only the added prefix/suffix.
Just a few quick thoughts.
You need to build semantic tree. It will based on the combination of keywords.
For example, automobile -->vehicle --> car this relation technical aspect of car. travel --
hire/rent-->vehicle-->car this is something related to travel and rent a car.
In this case MongoDB will help you a lot.

Why should (or shouldn't) a Search Query return back only document IDs?

So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...

Resources