Tagging schema for CouchDB. Would this work? - couchdb

I am wondering if the following scenario would be amenable to CouchDB? I am building a web-based flashcard application. Users can create flashcards (question on one side, answer on the other). Flashcard authors and other users can tag flashcards with keywords/phrases. Users can retrieve/generate virtual stacks of cards based on tags including support for boolean search (tagA AND tagB NOT tagC OR tagD). The DB will store cards (obviously) but also “documents” for users, tags and potentially virtual stacks of cards. I have read other SO questions concerning tagging within CouchDB but am wondering if the following would work or be to write intensive … (1) Card documents contain JSON array of tag strings assigned to that card, (2) Tag documents contain JSON array of cards using the tag, (3) tag documents also have element to store count of cards using that tag, (4) whenever a new card is created or tag is added to a card, the associated card identifier is also added to the tag document and the tag document's CardCount element is incremented. (5) Permanent views of cards indexed by card ID and of tags indexed by tag string are generated. If I know the card ID I can find the document quickly and can quickly get a list of associated tags. If I am given a tag string I can quickly find the tag document and then get a list of card IDs using the tag. For more elaborate boolean search I can retrieve a list of card IDs for each tag in the boolean search and then figure out the union/intersection/etc of these sets on the client. Does this seem reasonable? I am aware of the fulltext indexing option using Lucene but would like to avoid this if possible. Thanks.

I don't see a good reason to make things that complicated - just create a document per card and add tags to these documents as you go along. Create a few stored views to count/show tags and their usage number when you need it.
this way you'll only need:
card document:
question
answer
tags[]
views:
show card info
show (perhaps most popular) tags
show cards per tag
As long as your documents are properly structured, you won't need full text search to handle everything.

Related

GSA: index the e-commerce sites and display the results in sort by price?

We want to index few public e-commerce sites. When our customers search any one of the product the results, should display sort by pricing from all indexed e-commerce sites.
From My Understanding: The public e-commerce sites have different meta tag for pricing i cannot even consolidate into one meta tag.
Is there possible to Feed via XML, but don't have much idea inside how to achieve? we don't have db access to parse only required data
Via Entity recognition how i can able to index the price as a meta tag ?
Could u please advice us, whether it is achievable or not? If yes, which one is the best solution and refer document for this.
Ignoring the sorting issue and just concentrating on normalising the price metadata problem. You need a way to read the price from whatever metadata field it's in and create a new metadata field with a common name and the same value.
There are a few ways to do this but the simplest are probably:
Generate a Meta-and-url feed for each document and add in the normalised metadata
Crawl via a proxy that can add a X-GSA-External-Metadata header in containing the normalised metadata

Retrieving term ids for taxonomy field

I'm running into slow performance related to select N+1 issues when accessing the taxonomy terms associated with content items of a custom content type.
I've worked around issues like this in the past by getting all the related content ids up front and so I can use the ContentManager's GetMany method to get them all in one shot. For example, this has worked well for the MediaLibraryPickerField as I can easily get at the media content ids using the Ids property on the field. Here's an example similar to what I've done: Eager loading a field
I'd like to use a similar approach for getting taxonomy terms but I can't figure out how to get the term ids for a Taxonomy field on my content item. It seems that I can only access the lazy loaded term part which results in a select N+1 as they are retrieved for each content item.
Is there a way to get just the term ids for a taxonomy field without retrieving the whole term? I've spent some time digging around in the Taxonomy module source but I'm not finding any way to do it.
Any suggestions?
Try to inject IRepository<TermContentItem>. You should then be able to query over that, going beyond what the service offers natively.

CouchDB and Couchbase Document Keys

In reference material for CouchDB and Couchbase it's common guidance to store the type of a document as a parameter within the actual document.
I've got a database, where I have different documents that record certain behaviour by URL. So naturally, I use the URL as the id of the document.
The problem I find is that by using just the key as the document id, I now get clashes between documents of different types. So I have started using the type as the first part of the key like this:
{ doc._id: "rss_entry|http://www.spiegel.de/1234", [...] }
{ doc._id: "page_text|http://www.spiegel.de/1234", [...] }
Now I start to wonder why I've never seen this approach to model type in any of the documentation.
Prefixes are commonly used. In addition to support for scenarios such as yours, prefixing allows one to perform logical range queries against views. There is use of this technique in the modeling examples, but perhaps the concept is not described in as much detail as you are expecting. In the section http://docs.couchbase.com/couchbase-devguide-2.5/#modeling-documents, the documents are keyed as beer_NNNN and brewery_NNNN. Also, the section http://docs.couchbase.com/couchbase-devguide-2.5/#using-reference-documents-for-lookups goes a bit deeper into this technique. There is a counter document named user::count and then each user is keyed as user::NNNN. Additionally, there are documents in the example that are keyed as fb::NNNN for a Facebook ID, email::XXX#YYYY.com for a user's email address, etc.

RescorerProvider filter element by tag

Is it possible to create a RescorerProvider to filter out elements which are associated with a specific tag? Or should I implement an own model with relevant data as in the book "mahout in action" on page 79?
Route: /recommend/?rescorerParams=sports
Push score of elements which are associated with tag sports
For this you would have to separately track which items are associated to which tags. This information is not queryable within the server itself. But you can periodically cache this info from an external source and then apply whatever logic you like based on tags. Yes, it is kind of like what's in Mahout in Action.

How to find related items by tags in Lucene.NET

My indexed documents have a field containing a pipe-delimited set of ids:
a845497737704e8ab439dd410e7f1328|
0a2d7192f75148cca89b6df58fcf2e54|
204fce58c936434598f7bd7eccf11771
(ignore line breaks)
This field represents a list of tags. The list may contain 0 to n tag Ids.
When users of my site view a particular document, I want to display a list of related documents.
This list of related document must be determined by tags:
Only documents with at least one matching tag should appear in the "related documents" list.
Document with the most matching tags should appear at the top of the "related documents" list.
I was thinking of using a WildcardQuery for this but queries starting with '*' are not allowed.
Any suggestions?
Setting aside for a minute the possible uses of Lucene for this task (which I am not overly familiar with) - consider checking out the LinkDatabase.
Sitecore will, behind the scenes, track all your references to and from items. And since your multiple tags are indeed (I assume) selected from a meta hierarchy of tags represented as Sitecore Items somewhere - the LinkDatabase would be able to tell you all items referencing it.
In some sort of pseudo code mockup, this would then become
for each ID in tags
get all documents referencing this tag
for each document found
if master-list contains document; increase usage-count
else; add document to master list
sort master-list by usage-count descending
Forgive me that I am not more precise, but am unavailable to mock up a fully working example right at this stage.
You can find an article about the LinkDatabase here http://larsnielsen.blogspirit.com/tag/XSLT. Be aware that if you're tagging documents using a TreeListEx field, there is a known flaw in earlier versions of Sitecore. Documented here: http://www.cassidy.dk/blog/sitecore/2008/12/treelistex-not-registering-links-in.html
Your pipe-delimited set of ids should really have been separated into individual fields when the documents were indexed. This way, you could simply do a query for the desired tag, sorting by relevance descending.
You can have the same field multiple times in a document. In this case, you would add multiple "tag" fields at index time by splitting on |. Then, when you search, you just have to search on the "tag" field.
Try this query on the tag field.
+(tag1 OR tag2 OR ... tagN)
where tag1, .. tagN are the tags of a document.
This query will return documents with at least one tag match. The scoring automatically will take care to bring up the documents with highest number of matches as the final score is sum of individual scores.
Also, you need to realizes that if you want to find documents similar to tags of Doc1, you will find Doc1 coming at the top of the search results. So, handle this case accordingly.

Resources