Solr query that contains slash - search

I've found an interesting query for Solr and it returns search results, but I don't understand, what is the purpose of slash symbol between the words?
duties:health/nurse
Anybody knows? Please, help.

Simple. You can look at the analyzer chain to understand what happens.
My guess is that the analyzer chain turns the / into a space - which makes the query into
duties: health nurse
To find out your analyzer chain from the configuration - start by checking the type of the field
For example
<field name="health" type="text_general" indexed="true" stored="true" required="true"/>
Now we look for the definition of the type
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
As you can see, we have an index analyzer and a query analyzer.
My query analyzer would turn / in the query into something else by using the StandardTokenizerFactory.
From the solr wiki:
solr.StandardTokenizerFactory
A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware of the same token types. There aren't any filters that use StandardTokenizer's types.

I am thinking that health/nurse is being viewed as a string literal as there are no spaces between. Health / nurse should yield different results than health/nurse, correct? If so, then health/nurse must be an indexed term in your documents.

Related

Solr search does not return exact match

I am using Solr 6 to implement a search engine. The Problem I am facing is that when I searched for the word it returns some other results first and the actual query is at number 6.
For example I am searching for Cafe 9
It returns me this...
NECOS NATURAL STORE & CAFE
SATTAR BUKSH CAFE
THE PINK CADILLAC CAFE & RESTAURANT
CAFE ROCK LAHORE
CAFE CHEF ZAKIR
CAFE 9
What I want is that it show Cafe 9 in 1st place and then other results as Cafe 9 is the exact match..
I have indexed all the fields with type text_general and the schema.xml is attached.
Thanks in advance.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.ApostropheFilterFactory"/>
<!-- <filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/> -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If you want to boost the score of documents containing all the query terms in close proximity then you can pass the pf parameter with the value of the field name. In your case you should be passing pf=name (pf stands for phrase fields). The eDisMax query parser will attempt to make phrase queries out of all the terms in the q parameter, and if it’s able to find the exact phrase in any of the phrase fields, it will apply the specified boost to the match for that document.
In case you're not using the eDisMax query parser by default you can use it temporarily for the current query by passing q={!edismax pf=name}cafe 9.
You could also pass the pf2 parameter (as in pf2=name) which works in a way similar to pf except that the generated phrase queries are the bigrams in your query (that is, every two consecutive terms will be considered a boosting phrase). There's is also a pf3 parameter if that happens to be what you're looking for.
You can also customize the boost and pass more than one field name to the phrase proximity parameters (for instance, pf=name^2 title^3).

How to search arabic words in solr

In my solr schema.xml I defined product arabic name field as below
<field name="productNameArabic" type="text_ar" indexed="true" stored="true"/>
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
In solr search I want to search with product name using Arabic letters. While searching, Arabic user can feel little default to search some product name. Because some characters need to mention while searching.
Ex: إ أ آ
In the above mentioned characters, user can get combination of shift key. Usually if Arabic people will mention “ ا “ character and will get the below combined words.
Ex: إبرا
In my solr schema.xml I defined product arabic name field as below
I was able to achieve desired functionality by adding ASCIIFoldingFilter, this filter is able to remove accents from different languages, to make them similar in index time.
<fieldType name="arabic" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
Some more information about this filter - here. Working code example - here

How to ignore whitespaces on solr query

I have the name Audioslave indexed on Solr and I want to match that document to the query string Audio Slave.
I have the following rule configured:
<fieldType name="text_filter" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
preserveOriginal="1"
generateWordParts="1"
generateNumberParts="1"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
preserveOriginal="1"
generateWordParts="1"
generateNumberParts="1"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
And a field using it:
<field name="artist_name_filter" type="text_filter" multiValued="false" indexed="true" stored="true" required="false" />
When using Solr analysis tool everything looks good.
The Query part is the following:
The KeywordTokenizerFactory generates Audio Slave,
Then the WordDelimiterFilterFactory splits it into Audio Slave, Audio, AudioSlave and Slave (lets just use the 3rd column (AudioSlave) from here.
The TrimFilterFactory keeps it as AudioSlave
Finally the LowerCaseFilterFactory change it to audioslave
On the other hand, the index part is:
The KeywordTokenizerFactory generates Audioslave,
Then the WordDelimiterFilterFactory and TrimFilterFactory keeps it as Audioslave
Finally the LowerCaseFilterFactory change it to audioslave
So both fields should match, but the query returns no results:
http://localhost:8983/solr/search_api/select?defType=edismax&fq=type:Artist&q=Audio%20slave&qf=artist_name_filter&wt=json
Your problem isn't analysis, it's QueryParser syntax. Spaces are used to separate query clauses, and that isn't affected by the analyzer. When you have q=Audio slave, it applies query syntax rules first, and separates it into clauses "Audio" and "slave", and then analyzes each clause separately.
Escaping the space should do the job, I believe: q=Audio\ slave
A phrase query here seems like it ought to work, such as q="Audio slave", but it doesn't. It generates something like: "(audio slave audio audioslave) slave" for me, which is problematic.
Try by using the WhitespaceTokenizerFactory as a tokenizer for your index part.
Here the KeywordTokenizerFactory keeps the text as it is...it won't create any tokens.
Replace the same with WhitespaceTokenizerFactory.
WhitespaceTokenizerFactory will create tokens at space.

Search in solr with special characters

I have a problem with a search with special characters in solr.
My document has a field "title" and sometimes it can be like "Titanic - 1999" (it has the character "-").
When i try to search in solr with "-" i receive a 400 error. I've tried to escape the character, so I tried something like "-" and "\-". With that changes solr doesn't response me with an error, but it returns 0 results.
How can i search in the solr admin with that special character(something like "-" or "'"???
Regards
UPDATE
Here you can see my current solr scheme https://gist.github.com/cpalomaresbazuca/6269375
My search is to the field "Title".
excerpt from the schema.xml:
...
<!-- A general text field that has reasonable, generic
cross-language defaults: it tokenizes with StandardTokenizer,
removes stop words from case-insensitive "stopwords.txt"
(empty by default), and down cases. At query time only, it
also applies synonyms. -->
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="Title" type="text_general" indexed="true" stored="true"/>
You are using the standard text_general field for the title attribute. This might not be a good choice. text_general is meant to be for huge chunks of text (or at least sentences) and not so much for exact matching of names or titles.
The problem here is that text_general uses the StandardTokenizerFactory.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
StandardTokenizerFactory does the following:
A good general purpose tokenizer that strips many extraneous
characters and sets token types to meaningful values. Token types are
only useful for subsequent token filters that are type-aware of the
same token types.
This means the '-' character will be completely ignored and be used to tokenize the String.
"kong-fu" will be represented as "kong" and "fu". The '-' disappears.
This does also explain why select?q=title:\- won't work here.
Choose a better fitting field type:
Instead of the StandardTokenizerFactory you could use the solr.WhitespaceTokenizerFactory, that only splits on whitespace for exact matching of words. So making your own field type for the title attribute would be a solution.
Solr also has a fieldtype called text_ws. Depending on your requirements this might be enough.
To search for your exact phrase put inverted commas round it:
select?q=title:"Titanic - 1999"
If you just want to search for that special character then you will need to escape it:
select?q=title:\-
Also check:
Special characters (-&+, etc) not working in SOLR Query
If you know exactly which special characters you dont want to use then you can add this to the regex-normalize.xml
<regex>
<pattern>-</pattern>
<substitution>%2D</substitution>
</regex>
This will replace all "-" with %2D, so when you search, as long as you search for %2D instead of the "-" it will work fine
I spent a lot of time getting this done. Here is a clear step-by-step things to be done to query special characters in SolR. Hope it helps someone.
Edit the schema.xml file and find the solr.TextField that you are
using.
Under both, "index" and query" analyzers modify the
WordDelimiterFilterFactory and add types="characters.txt" Something like:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
</analyzer>
</fieldType>
Ensure that you use WhitespaceTokenizerFactory as the tokenizer as
shown above.
Your characters.txt file can have entries like-
\# => ALPHA
# => ALPHA
\u0023 => ALPHA
ie:- pointing to ALPHA only.
Clear the data, re-index and query for the entered characters. It
will work.

Anyone has the best way for synonym search of multi keyword in solr?

I want to use synonym search in solr for multi keyword.
But It doesn't work correct.
I set the synonym "multi term" for "multerm" in synonym.txt. And I expect that Solr makes query-phrase for "multerm" just like "field:"multi term"~0 but "field:multi | field:term". So It can't do intimacy search for multi term synonym.
Any one has the best way for multi term synonym search in Solr? Help me please~
Here is how I handle multi-word synonyms. In my schema.xml, fieldType definition looks like:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizer="solr.KeywordTokenizerFactory"/>
<fieldType name="custom_text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- We will use synonyms only at index time to keep querying fast-->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizer="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- We will use synonyms only at index time to keep querying fast
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
Couple of things to note:
I am using synonyms only at index time, to keep queries fast.
I added KeywordTokenizerFactory, it treats the entire field as a single token, and does not split multi-word synonyms
I added expand="true". If expand is true, a synonym will be expanded to all equivalent synonyms. If it is false, all equivalent synonyms will be reduced to the first in the list.
Query time synonyms are commented out.

Resources