Indexing multi-lingual content with Lucene.net

Indexing multi-lingual content with Lucene.net - search

I use Lucene.net for indexing content & documents etc.. on websites. The index is very simple and has this format:
LuceneId - unique id for Lucene (TypeId + ItemId)
TypeId - the type of text (eg. page content, product, public doc etc..)
ItemId - the web page id, document id etc..
Text - the text indexed
Title - web page title, document name etc.. to display with the search results
I've got these options to adapt it to serve multi-lingual content:
Create a separate index for each language. E.g. Lucene-enGB, Lucene-frFR etc..
Keep the one index and add an additional 'language' field to it to filter the results.
Which is the best option - or is there another? I've not used multiple indexes before so I'm leaning toward the second.

I do [2], but one problem I have is that I cannot use different analyzers depending on the language. I've combined the stopwords of the languages I want, but I lose the capability of more advanced stuff that the analyzer will offer such as stemming etc.

You can eliminate option 1 and 2.
You can use one index and the fields that contains arabic words create two fileds for each:
If you have field "Text" might contain arabic or english contents ==>
Create 2 fields for "Text" : 1 field, "Text", indexed/searched with your standard analyzer and another one, "Text_AR" , with the arabicAnalyzer. In order to achieve that you can use
PreFieldAnalyzerWrapper

Related

The trouble with searching for a single-word sentence

I have a text field for tags. For example some entities:
{"tags": "apple. fruits. eat."}
{"tags": "green apple."}
{"tags": "banana. apple."}
I want to select entities with tag apple, not green apple or smth apple smth. Different variants lead to the one point: select a sentence with existing expression and it doesn't matter how this sentence looks like. But in this case it's matter.
How can I do it by using Lucene syntax or Azure Search tools? Or (in general) how can I search for a completely same sentence?

I presume that the "." is a deliminator for the different tags. There may be a way to express this in lucene, but you may need to add some custom analyzers to preserve the "."'s in tokenization.
A better strategy in this case would be use use a field of type Collection(Edm.String). This will allow you to better preserve structure the phrases for the tags, and you can use a filter to select the specific value of "apple". Collection(Edm.String) also allows you to enable faceting of the tags which is useful.

Implementing search : Identifying known keywords

I have implemented search functionality for my e-Commerce website using elastic search. The basic structure is like, each product has a title and whatever the user enters I search the exact string using elastic and return the result.
Now I notice that most of the search phrases (almost 90%) follow a similar pattern. It contains:
Brand name of the product (Apple, Nokia etc.)
Category of the product (phone, mobile phone, smartphone etc.)
Model name of the product (iPhone 6S, Lumia 950 etc.)
Now I think if I am able to identify the specific components, then I can return better results than just text match.
I have list of brands, categories and models. If i am able to identify the terms present, then I can request elasticsearch with that field specifically
For example, a search string of "Apple iPhone 5S", I should be able to deduce that brand=Apple.
EDIT: More details as asked in comments
Structure of document:
I have a single index and each document ID is the SKU of the product and it contains the following fields
title (Apple iPhone 5S)
brand (Apple)
categ (Electronics)
sub_categ (Smartphones)
model (iPhone 5S)
attribs (dictionary of product attributes particular to each sub_categ like {"color": "gold", "memory": "32 GB", "battery": "1570 mAh"})
price
Use Case:
Now when the user searches for phrase "iphone 5s battery", elastic returns search results which returns even the phone. (I agree the relevance score matches better for battery)
What I am trying to achieve is, I have master list of sub categories. So if any word from the search phrase is present in the master list, then i would search on elasticsearch with query ["must": {"sub_categ": "battery"}]. So the result from "Smartphones" sub category would not be fetched from elastic. I wish to replicate this across multiple fields like brand, category etc
My question is, how do I find if brand or any other particular word from the master list if present in the search phrase quickly? The only option i could think of is, looping through the master list and check if the word is present in the search phrase. If present, then keep note of it and do the same across all master list field (brand, categ, sub categ). Then generate the query with must and then querying them. I wish to know if there is a better way of accomplishing it.

The person in the Lucene world who has spoken the most on this topic is Ted Sullivan. (He calls this "auto-filtering", and has a component which does this available for Solr)
I realize you're using Elasticsearch, but Ted's component works by introspecting FieldCache data (exposed by Lucene) so should be possible to implement something very similar with Elasticsearch (look at the code).
There is also a discussion in this article about how to create a separate index for providing pre-query intelligence like you've described (e.g. your term "Apple" is most frequently found in the company field).

Solr search with ranking and best match

i am new to this forum. I am looking for you suggestion on one of our searching requirement.
We have data of names , addresses and other relevant data to search for. The input for search going to be a free from text string with more than one word. The search api should match the input string against the complete data set includes names,address and other data. To fulfill the same , i have used copyField to copy all the required fields to a search field in solr confg. I am using the searchField as searchble agianst the input string that comes in. The input search string can have partial words like example below.
Name: Test Insurance company
Address: 123 Main Avenue, Galaxy city
Phone: 6781230000
After solr creates the index, the searchable field will have the document like below
searchField {
Name: Test Insurance company
Address: 123 Main Avenue, Galaxy city
Phone: 6781230000
}
End user can enter search string like "Test Company Main Ave" and the search is currently returns the above document. But not at the top, i see other documents are being returned too.
I am framing the solr query as ""Test* Company Main Ave" , adding a "*" after first word and going against the searchFiled
I have followed this approach after searching few forums over internet. How can i get the maximum match at the top. Not sure the above approach is right.
Any help appreciated.
Thanks,
Ram

You could index all fields separately and also use your searchField as a catchall.
Use an Edismax search handler to query all field with a scoring boost + also query your catchall field.
eg.
<str name="qf">
Name^2.0
Address^1.5
.
.
.
searchField^1.0
</str>
To boost relevancy, you could also index each field twice, once with a string type and then with a text_en type, as per this
<str name="qf">
Name^2.0
Name_exact^5.0
Address^1.5
Address_exact^3.0
.
.
.
searchField^1.0
</str>

Technically if there are documents above the one you want to match then they are a better match so it depends why they are getting a higher relevancy score. Try turning the debug on and see where the documents above your preferred document are getting the extra relevancy from.
Once you know why they are coming higher then you need to ask yourself why should your preferred document come first, what makes it a "better" match in your eyes.
Once you've decided why it should come top then you need to work out how to index and search the content so that the documents you expect to come first actually do come first, you may as qux says in his answer need to index multiple versions of the data to allow for better matching etc.
Si

Solr, managing entities

I have the following situation when using Solr. My document contains "entities" for example "peanut butter". I have a list of such entities. These are items that go together and are not to be treated as two individual words. During indexing, I want solr to realize this and treat "peanut butter" as an entity. For example if someone searches for
"peanut"
then documents that have the word peanut should rank higher than documents that have the word "peanut butter". However if someone searches for
"peanut butter"
then the document that has peanut butter should show up higher than ones that have just peanut. Is there a config setting somewhere which can be modified such that the entity list can be specified in a file and Solr would do the needful?

Configure that field to use a StrField type, instead of a TextField. TextField is designed to handle tokenization and full-text search on textual content. StrField treats it's contents as a keyword, and so does not tokenize.

How to find related items by tags in Lucene.NET

My indexed documents have a field containing a pipe-delimited set of ids:
a845497737704e8ab439dd410e7f1328|
0a2d7192f75148cca89b6df58fcf2e54|
204fce58c936434598f7bd7eccf11771
(ignore line breaks)
This field represents a list of tags. The list may contain 0 to n tag Ids.
When users of my site view a particular document, I want to display a list of related documents.
This list of related document must be determined by tags:
Only documents with at least one matching tag should appear in the "related documents" list.
Document with the most matching tags should appear at the top of the "related documents" list.
I was thinking of using a WildcardQuery for this but queries starting with '*' are not allowed.
Any suggestions?

Setting aside for a minute the possible uses of Lucene for this task (which I am not overly familiar with) - consider checking out the LinkDatabase.
Sitecore will, behind the scenes, track all your references to and from items. And since your multiple tags are indeed (I assume) selected from a meta hierarchy of tags represented as Sitecore Items somewhere - the LinkDatabase would be able to tell you all items referencing it.
In some sort of pseudo code mockup, this would then become
for each ID in tags
get all documents referencing this tag
for each document found
if master-list contains document; increase usage-count
else; add document to master list
sort master-list by usage-count descending
Forgive me that I am not more precise, but am unavailable to mock up a fully working example right at this stage.
You can find an article about the LinkDatabase here http://larsnielsen.blogspirit.com/tag/XSLT. Be aware that if you're tagging documents using a TreeListEx field, there is a known flaw in earlier versions of Sitecore. Documented here: http://www.cassidy.dk/blog/sitecore/2008/12/treelistex-not-registering-links-in.html

Your pipe-delimited set of ids should really have been separated into individual fields when the documents were indexed. This way, you could simply do a query for the desired tag, sorting by relevance descending.

You can have the same field multiple times in a document. In this case, you would add multiple "tag" fields at index time by splitting on |. Then, when you search, you just have to search on the "tag" field.

Try this query on the tag field.
+(tag1 OR tag2 OR ... tagN)
where tag1, .. tagN are the tags of a document.
This query will return documents with at least one tag match. The scoring automatically will take care to bring up the documents with highest number of matches as the final score is sum of individual scores.
Also, you need to realizes that if you want to find documents similar to tags of Doc1, you will find Doc1 coming at the top of the search results. So, handle this case accordingly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string