how to set up the searchable fields in Lucene - search

I have incoming queries and I want to only search in certains fields (author, book title) not in field (book content). How can I achieve this in Lucene?
another questions is that if how can I give a higher rank to documents that have matches in the author field. For example, doc1 have match in "book content", and doc2 has match in "author", how can I rank higher for doc2

You can combine multiple queries using BooleanQuery, and have Occur.Should (meaning OR). I also believe that you can boost specific queries in such a scenario, which means that matches in a specific field has higher relevance that for instance, content.
Example (C#):
var query = new BooleanQuery();
query.Add(new TermQuery("author", searchTerm), Occur.Should);
query.Add(new TermQuery("book title", searchTerm), OCcur.Should);

Related

Lucene (6.2.1) deleteDocuments based on StoredField

I am using few StoredField and few TextField in my indexing (Lucene 6.2.1)
for every document I have my own unique ID
if I create field as
Field docID = new TextField("docID", docId, Field.Store.YES);
I am able to delet document like following
Field transactionIdField = new TextField("transactionId", transactionId, Field.Store.YES);
Term docIdTerm = new Term("docID", docId);
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
IndexWriter writer = repositoryWriters.getTargetIndexWriter(repositoryUuid);
// 4. remove document with docId
writer.deleteDocuments(docIdTerm);
LOG.log(Level.INFO, "Document removed from Index, docID: {0}", docId);
writer.commit();
But if I create field as
Field docID = new SttoredField("docID", docId);
the document is not deleted
How can I delete a document based on a Stored Field Value?
I want to keep it a StoredField so tat users can not search teh document based on docID
Quoting StoredField documentation,
A field whose value is stored so that IndexSearcher.doc and
IndexReader.document() will return the field and its value.
i.e. it would simply be a stored field for a document and there would be no Terms or Indexing for this field.
Method, IndexWriter.deleteDocuments(Term...terms) wouldn't find that document since there will be no Term for a StoredField.
A TextField on the other hand is indexed and terms generated for it,
A field that is indexed and tokenized, without term vectors. For
example this would be used on a 'body' field, that contains the bulk
of a document's text.
A stored TextField is indexed as well as stored so terms are available and value is stored to re construct the document too.
So in sumamry, you can't delete a document on the basis of only a StoredField , you need an indexed field too - with same name to be able to delete it.

loopback relational database hasManyThrough pivot table

I seem to be stuck on a classic ORM issue and don't know really how to handle it, so at this point any help is welcome.
Is there a way to get the pivot table on a hasManyThrough query? Better yet, apply some filter or sort to it. A typical example
Table products
id,title
Table categories
id,title
table products_categories
productsId, categoriesId, orderBy, main
So, in the above scenario, say you want to get all categories of product X that are (main = true) or you want to sort the the product categories by orderBy.
What happens now is a first SELECT on products to get the product data, a second SELECT on products_categories to get the categoriesId and a final SELECT on categories to get the actual categories. Ideally, filters and sort should be applied to the 2nd SELECT like
SELECT `id`,`productsId`,`categoriesId`,`orderBy`,`main` FROM `products_categories` WHERE `productsId` IN (180) WHERE main = 1 ORDER BY `orderBy` DESC
Another typical example would be wanting to order the product images based on the order the user wants them to
so you would have a products_images table
id,image,productsID,orderBy
and you would want to
SELECT from products_images WHERE productsId In (180) ORDER BY orderBy ASC
Is that even possible?
EDIT : Here is the relationship needed for an intermediate table to get what I need based on my schema.
Products.hasMany(Images,
{
as: "Images",
"foreignKey": "productsId",
"through": ProductsImagesItems,
scope: function (inst, filter) {
return {active: 1};
}
});
Thing is the scope function is giving me access to the final result and not to the intermediate table.
I am not sure to fully understand your problem(s), but for sure you need to move away from the table concept and express your problem in terms of Models and Relations.
The way I see it, you have two models Product(properties: title) and Category (properties: main).
Then, you can have relations between the two, potentially
Product belongsTo Category
Category hasMany Product
This means a product will belong to a single category, while a category may contain many products. There are other relations available
Then, using the generated REST API, you can filter GET requests to get items in function of their properties (like main in your case), or use custom GET requests (automatically generated when you add relations) to get for instance all products belonging to a specific category.
Does this helps ?
Based on what you have here I'd probably recommend using the scope option when defining the relationship. The LoopBack docs show a very similar example of the "product - category" scenario:
Product.hasMany(Category, {
as: 'categories',
scope: function(instance, filter) {
return { type: instance.type };
}
});
In the example above, instance is a category that is being matched, and each product would have a new categories property that would contain the matching Category entities for that Product. Note that this does not follow your exact data scheme, so you may need to play around with it. Also, I think your API query would have to specify that you want the categories related data loaded (those are not included by default):
/api/Products/13?filter{"include":["categories"]}
I suggest you define a custom / remote method in Product.js that does the work for you.
Product.getCategories(_productId){
// if you are taking product title as param instead of _productId,
// you will first need to find product ID
// then execute a find query on products_categories with
// 1. where filter to get only main categoris and productId = _productId
// 2. include filter to include product and category objects
// 3. orderBy filter to sort items based on orderBy column
// now you will get an array of products_categories.
// Each item / object in the array will have nested objects of Product and Category.
}

Lucene wild card search

How can I perform a wildcard search in Lucene ?
I have the text: "1997_titanic"
If I search like "1997_titanic", it is returning a result, but I am not able to do below two searches:
1) If I search with only 1997 it is not returning any results.
2) Also if there is a space, such as in "spider man", that is not finding any results.
I retrieve all movie information from a DB and store it in Lucene Documents:
public Document createMovieDoc(Movie m){
document.add(new StoredField("moviename", m.getName()));
TextField field = new TextField("movienameSearch", m.getName().toLowerCase(), Store.NO);
field.setBoost(5.0f);
document.add(field);
}
And to search, I have this method:
public List searh(String txt){
PhraseQuery phQuery= new PhraseQuery();
Term term = new Term("movienameSearch", txt.toLowerCase());
BooleanQuery b = new BooleanQuery();
b.add(phQuery, Occur.SHOULD);
TopFieldDocs tp= searcher.search(b, 20, ..);
for(int i=0;i<tp.length;i++)
{
int mId = tp[i].doc;
Document d = searcher.doc(mId);
String moviename = d.get("moviename");
list.add(moviename);
}
return list;
}
I'm not sure what analyzer you are using to index. Sounds like maybe WhitespaceAnalyzer? It sounds like, when indexing "1997_titanic" remains a single token, while "spider man" is split into the token "spider" and "man".
Could also be SimpleAnalyzer which uses a LetterTokenizer. This would make it impossible to search for "1997", since that tokenizer will eliminate all numbers for the indexed representation of the text.
Your search method doesn't look right. You aren't adding any terms to your PhraseQuery, so I wouldn't expect it to find anything. You must add some terms in order for anything to be found. You create a Term in what you've provided, but nothing is ever done with that Term. Maybe this has something to do with how you've pick your excerpts, or something? Not sure, I'm a bit confused by that.
In order to manually construct a PhraseQuery you must add each term individually, so to search for "spider man", you would do something like:
PhraseQuery phQuery= new PhraseQuery();
phQuery.add(new Term("movienameSearch", "spider"));
phQuery.add(new Term("movienameSearch", "man"));
This requires you to know what the analyzer was doing at index time, and tokenize the input yourself to suit. The simpler solution is to just use the QueryParser:
//With whatever analyzer you like to use.
QueryParser parser = new QueryParser(Version.LUCENE_46, "defaultField", analyzer);
Query query = parser.parse("movienameSearch:\"" + txt.toLowerCase() + "\"");
TopFieldDocs tp= searcher.search(query, 20);
This allows you to rely on the same analyzer to index and query, so you don't have to know how to tokenize your phrases to suit.
As far as finding "1997" and "titanic" individually, I would recommend just using StandardAnalyzer. It will tokenize those into discrete tokens, allowing them to be searched very easily, with a simple query like: movienameSearch:1997.

couchdb - Map Reduce - How to Join different documents and group results within a Reduce Function

I am struggling to implement a map / reduce function that joins two documents and sums the result with reduce.
First document type is Categories. Each category has an ID and within the attributes I stored a detail category, a main category and a division ("Bereich").
{
"_id": "a124",
"_rev": "8-089da95f148b446bd3b33a3182de709f",
"detCat": "Life_Ausgehen",
"mainCat": "COL_LEBEN",
"mainBereich": "COL",
"type": "Cash",
"dtCAT": true
}
The second document type is a transaction. The attributes show all the details for each transaction, including the field "newCat" which is a reference to the category ID.
{
"_id": "7568a6de86e5e7c6de0535d025069084",
"_rev": "2-501cd4eaf5f4dc56e906ea9f7ac05865",
"Value": 133.23,
"Sender": "Comtech",
"Booking Date": "11.02.2013",
"Detail": "Oki Drucker",
"newCat": "a124",
"dtTRA": true
}
Now if I want to develop a map/reduce to get the result in the form:
e.g.: "Name of Main Category", "Sum of all values in transactions".
I figured out that I could reference to another document with "_ID:" and ?include_docs=true, but in that case I can not use a reduce function.
I looked in other postings here, but couldn't find a suitable example.
Would be great if somebody has an idea how to solve this issue.
I understand, that multiple Category documents may have the same mainCat value. The technique called view collation is suitable to some cases where single join would be used in relational model. In your case it will not help: although you use two document schemes, you really have three level structure: main-category <- category <- transaction. I think you should consider changing the DB design a bit.
Duplicating the data, by storing mainCat value also in the transaction document, would help. I suggest to use meaningful ID for the transaction instead of generated one. You can consider for example "COL_LEBEN-7568a6de86e5e" (concatenated mainCat with some random value, where - delimiter is never present in the mainCat). Then, with simple parser in map function, you emit ["COL_LEBEN", "7568a6de86e5e"] for transactions, ["COL_LEBEN"] for categories, and reduce to get the sum.

CouchDB for Fixed Categories Queries

I have documents like this in my CouchDB:
{
"_id": "0cb35be3cc73d6859c303fa3200011d2",
"_rev": "1-f6e356bbf6ab09290aae11132af50d66",
"adresse": "Bohrgaß 10 /",
"plz": 56814,
"ort": "Faid /",
"kw": 2.32,
"traeger": "SOL"
...
}
There are predefined categories for certain attributes e.g. traeger: "SOL", "BIO", "WAS"; kw: <2, 2-5, 5-20, 20-100; plz: 56814, plz: 56815; ...
I have to be able to efficiently query the total number of docs for every category and
the total number of docs and the docs itself under certain conditions. E.g.
How many docs are in the category kw <2 (and all other kw categories) under the condition traeger = "SOL"
How many docs are in the category traeger = "SOL" (and all other traeger categories) under the conditions plz=56814 AND kw < 2
The user can select which catagories he likes to combine. The categories are fix. There also will be more attributes and catagories.
How would map/ reduce functions for this look like?
Marcel
Since you are going to count documents, your reduce function is simply the built-in count. Your map function needs to emit the appropriate keys your users are going to search for. Finally, when the view is queried, the appropriate group level has to be picked.
Example: You can create a view with a composite key ["traeger", "kw"]. If you query that view with group_level = 2, you get the number of documents for each combination of traeger and kw.
If you only care about the traeger "SOL", you can restrict the output with the start_key and end_key parameters.
If you want to know the number of documents in each "traeger" category no matter their "kw", you can query that view with group_level 1.
For your second example, you can create a view with the key ["plz","kw","traeger"] and query it using start_key and end_key to restrict the results to plz=56814 AND kw < 2 and set group_level to 3.
Querying options for views are listed here:
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options

Resources