Tune solr phrase query search - search

We are trying to tune our phrase queries in DSE search.
For example, if we have column name X with the value "D A T A S T A X" we are searching for exact match for X:"T A S T"
Words are tokenized with with whitespacetokenizer.
We have couple hundred Million records in database and all the indexes are memory (We tested using pcstat). However still the queries are taking 5-15 sec. Why it is taking so time to pull the results if all the indexes are in memory? How can I tune this?
Any help is appreciated.

Try this fieldType:
<fieldType name="custom_edge_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^A-Za-z0-9])" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^A-Za-z0-9])" replacement="" replace="all"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Here the KeywordTokenizerFactory tokenizeer will pass the text stream exactly to the filters. The PatternReplaceFilterFactory will remove all except characters and numbers. You can config this however you want. Then we lowercase the stream and generate the NGram. This is for the index phase. For the query phase we don't do the NGram because we want to match the exact sub string.
We will be use the NGram instead of EdgeNGram, Because that will provides substring. The EdgeNGram always contain either from start or end. So EdgeNGram is not helpful in this case.
Hope this helps.

Related

How to ignore stop words and use remaining words to fetch expected results alone using Solr?

I have a name field, with the following definition:
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
</fieldType>
and I would like to search for contents based on the following criteria:
Example values:
test a value
test of solr
not a value
test me says value
So, if i do search for test value, I should only get results containing both test and values. And, even if I do test a value, I should still get only the same result as stop words will be excluded. But, with this current setup, even with including edismax in the solr query, I get all of the results. It goes by a record with either of those two tokens. Could someone suggest me the update I could do to the definition to get a result as expected? And, am I using stopwords as expected? I do not want stopwords in the search consuming execution time.
I updated the definition as per the suggestion and even then the result does not make any sense to me.
I have a value what a term. And, there are other values like what the term ; about a term; about the term; description test; Name a Nothing etc. A search for what a term returns all of these. And, I also had a value just a and the. They were also getting returned in the result. Though, for what a term, as per the below screenshot, the query omits the stop word, the result does not make any sense to me.
You can ignore the stopsword during the search and index time. You cannot ignore these words in the response. The response will come as the text is stores as it is. The data stored for search and response is different. Search happens on the indexed data. In the response you get the data stored.
In your field type definition you are using KeywordTokenizerFactory.
KeywordTokenizerFactory : is used to when you dont want to create any tokens of your text. This tokenizer treats the entire text field as a single token.
Use of any other filter is of no use after this.
You can use StandardTokenizerFactory instead KeywordTokenizerFactory.
StandardTokenizerFactory : This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
</fieldType>
On the analysis page, I analyzed the data for the above field.
Index the data "test a value" and query it with "test value", the result is found. Here you can see while indexing the data the stopwords are skipped as we have applied the stopwordfilterfactory
Now use "test a value" while indexing as well query in the analyze page.
It skips the stopword "a" as filter is applied and matched the result.

Solr search does not return exact match

I am using Solr 6 to implement a search engine. The Problem I am facing is that when I searched for the word it returns some other results first and the actual query is at number 6.
For example I am searching for Cafe 9
It returns me this...
NECOS NATURAL STORE & CAFE
SATTAR BUKSH CAFE
THE PINK CADILLAC CAFE & RESTAURANT
CAFE ROCK LAHORE
CAFE CHEF ZAKIR
CAFE 9
What I want is that it show Cafe 9 in 1st place and then other results as Cafe 9 is the exact match..
I have indexed all the fields with type text_general and the schema.xml is attached.
Thanks in advance.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.ApostropheFilterFactory"/>
<!-- <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/> -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If you want to boost the score of documents containing all the query terms in close proximity then you can pass the pf parameter with the value of the field name. In your case you should be passing pf=name (pf stands for phrase fields). The eDisMax query parser will attempt to make phrase queries out of all the terms in the q parameter, and if it’s able to find the exact phrase in any of the phrase fields, it will apply the specified boost to the match for that document.
In case you're not using the eDisMax query parser by default you can use it temporarily for the current query by passing q={!edismax pf=name}cafe 9.
You could also pass the pf2 parameter (as in pf2=name) which works in a way similar to pf except that the generated phrase queries are the bigrams in your query (that is, every two consecutive terms will be considered a boosting phrase). There's is also a pf3 parameter if that happens to be what you're looking for.
You can also customize the boost and pass more than one field name to the phrase proximity parameters (for instance, pf=name^2 title^3).

Solr: ClassicFilterFactory with acronyms & use of Solr's analyzer

I have a schema.xml with a text type, that uses tokenizers, filters... at index and other at query time. Now I have the problem, that a search query, which should return some results, doesn't return anything. So I thought, using Solr's analyzer would bring me closer to the root of the problem.
I have the following string: Foo Bar Ges.m.b.H
This is my schema.xml definition for the field type text:
<fieldType name="text" class="solr.TextField" omitNorms="false" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="2" catenateAll="1" preserveOriginal="1" splitOnNumerics="0"/>
</analyzer>
</fieldType>
When I search for Foo Bar I get all the results back, so the problem lies within the Ges.m.b.H. (notice the missing dot at the end). I have a few questions about this:
1. ClassicFilterFactory
ClassicFilterFactory only works on acronyms that are in this format LETTER.LETTER.LETTER.. For example, G.m.b.H. -> GmbH. But it doesn't work on acronyms like G.m.b.H (missing dot at the end) or Ges.m.b.H. or Ges.m.b.H . Is there a way to get this to work? For now, I'm doing it with the WordDelimiterFilterFactory, but it would be good to know, if there is a better way.
2. Solr's Analzer
I tried to analyze the index and query time with solr's analyzer. My text get's splitted on index and query time, as expected. When I fill out the field for index and query, I get this highlighted fields that look like, if there was a hit. Here are some screenshots:
The screenshot above is from index time of Foo Bar Ges.m.b.H, LowerCaseFilterFactory. I also get "hits" at other filters like my last filter ReverseStringFilterFactory:
The next screenshot is from query time:
To me, it looks like, Solr is looking at the last line of my query tokenizer/filter stuff, and searches for hits in the indexed documents, and if there were some hits, they get highlighted. But unfortunately, this search doesn't return any hits, when used in my normal search.
I drilled it down to exclude any other queries:
http://localhost:8982/solr/atalanda_development/select?q=foo+bar+ges.m.b.h&defType=edismax&qf=vendor_name_search_text
Summing up:
Any ideas, why this doesn't work?
Am I right, that the highlighted, kinda purple fields, are hits? Can someone explain, HOW Solr is doing this, so that I can understand this in the future?
Any suggestions to the ClassicFilterFactory problem would be great!

Search Solr: Match exact result Koh S*

I have a schema as below
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And my document will be:
Koh Samui
Koh Chang
Koh Lanta
When i do the search Koh* It returns 3 results. it accepted. But when i search koh S*. it return zero result. I want the result would be only Koh Samui
I assume you want to use a wildcard query. This depends on the Query Parser you are using. If you are using the Standard Query Parser, Dismax, or eDismax, all of these don't support wild card queries.
And to be honest if you are using Solr to search for wildcard queries, you might be using the wrong tool as the solution, you can use any SQL Database and use text like 'Koh S%' in the where clause, and this will give you what you want.
But anyway, to use a wild card Query in Solr you need to use the Lucene query parser directly. To do this, use the following query:
http://localhost:8983/solr/core/select?q={!lucene df=text}Koh S*

Solr for Arabic

I'm using Solr to index documents in 3 langues(arabic, french and english), I have used this fieldType :
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Everything was good, but in arabic language when I put this request to search a word like حقل Solr doen't find the word, but when I put the word in oppositeلقح from left to right Solr find the word and return result.
Can I have result for arabic words ?
I'm going to turn Daniel's clever analysis here to an answer for the record. Don't vote for this, just go find something of his to vote for :-)
There are two ways to get a directionality mismatch with RTL text. You can be indexing it backwards, or you can be querying it backwards. A simple HTML form querying Solr will never mess up directionality. In this care, khaled was extracting text from a PDF using a library that falls victim to the tendency of PDFs to contain 'visual-order' text rather than 'logical order'. So the index was full of backwards Arabic. To fix this, he will have to come up with a working library that extracts text from pdfs.
Forcing Apache Tika to use the latest Apache PDFbox might help, or his PDF may be so quirky that even the latest PDFBox can't handle it. In which case he has a hard problem.

Resources