I'm currently developping a search engine using Solr for an ecommerce website. So I get these two fields in my schema.xml:
<field name="sku" type="string" indexed="true" stored="true" required="false" />
<field name="collection" type="string" indexed="true" stored="true" required="false" />
(The complete schema.xml is available below)
For information:
sku looks like this: 959620, 929345, 912365, ...
collection looks like this: Alcott, Spigrim, Tantal,...
They are well indexed. For instance, when I look for:
http://localhost:8080/solr/myindex/select/?q=Alcott
I got all products with collection "Alcott".
But when I look for;
http://localhost:8080/solr/myindex/select/?q=959620
I got nothing.
However, when I go deep forward with this request,
http://localhost:8080/solr/myindex/select/?q=sku:969520
I do have the product attached to this sku.
Is there any way to have "q=969520" working ? And even better: "q=96" resulting all products with sku starting by "96" ?
Thank you for your help !
schema.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.2">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>
<!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
<fieldtype name="binary" class="solr.BinaryField"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
<!-- A Trie based date field for faster date range queries and date faceting. -->
<fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/>
<fieldType name="pint" class="solr.IntField" omitNorms="true"/>
<fieldType name="plong" class="solr.LongField" omitNorms="true"/>
<fieldType name="pfloat" class="solr.FloatField" omitNorms="true"/>
<fieldType name="pdouble" class="solr.DoubleField" omitNorms="true"/>
<fieldType name="pdate" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="random" class="solr.RandomSortField" indexed="true" />
<!-- A text field that only splits on whitespace for exact matching of words -->
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
</fieldType>
<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<!-- normalisation des accents, cédilles, e dans l'o,... -->
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<!-- découpage selon les espaces -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- suppression de la ponctuation -->
<filter class="solr.PatternReplaceFilterFactory" pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
<!-- suppression des tokens vides et des mots démesurés -->
<filter class="solr.LengthFilterFactory" min="1" max="100" />
<!-- passage en minuscules -->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- suppression des élisions (l', qu',...) -->
<filter class="solr.ElisionFilterFactory" articles="elisionwords_fr.txt"/>
<!-- découpage des mots composés -->
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
<!-- suppression des mots insignifiants -->
<filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_fr.txt" enablePositionIncrements="true"/>
<!-- gestion des synonymes -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_fr.txt" ignoreCase="true" expand="true"/>
<!-- partie de mot -->
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6"/>
<!-- lemmatisation (pluriels,...) -->
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords_fr.txt"/>
<!-- suppression des doublons éventuels -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="index">
<!-- normalisation des accents, cédilles, e dans l'o,... -->
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<!-- découpage selon les espaces -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- suppression de la ponctuation -->
<filter class="solr.PatternReplaceFilterFactory" pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
<!-- suppression des tokens vides et des mots démesurés -->
<filter class="solr.LengthFilterFactory" min="1" max="100" />
<!-- passage en minuscules -->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- suppression des élisions (l', qu',...) -->
<filter class="solr.ElisionFilterFactory" articles="elisionwords_fr.txt"/>
<!-- découpage des mots composés -->
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
<!-- suppression des mots insignifiants -->
<filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_fr.txt" enablePositionIncrements="true"/>
<!-- gestion des synonymes -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_fr.txt" ignoreCase="true" expand="true"/>
<!-- partie de mot -->
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6"/>
<!-- lemmatisation (pluriels,...) -->
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords_fr.txt"/>
<!-- suppression des doublons éventuels -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- Less flexible matching, but less false matches. Probably not ideal for product names,
but may be good for SKUs. Can insert dashes in the wrong place and still match. -->
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
possible with WordDelimiterFilter in conjuncton with stemming. -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- A general unstemmed text field - good if one does not know the language of the field -->
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- A general unstemmed text field that indexes tokens normally and also
reversed (via ReversedWildcardFilterFactory), to enable more efficient
leading wildcard queries. -->
<fieldType name="text_rev" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
<fieldtype name="phonetic" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
</analyzer>
</fieldtype>
<fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--
The DelimitedPayloadTokenFilter can put payloads on tokens... for example,
a token of "foo|1.4" would be indexed as "foo" with a payload of 1.4f
Attributes of the DelimitedPayloadTokenFilterFactory :
"delimiter" - a one character delimiter. Default is | (pipe)
"encoder" - how to encode the following value into a playload
float -> org.apache.lucene.analysis.payloads.FloatEncoder,
integer -> o.a.l.a.p.IntegerEncoder
identity -> o.a.l.a.p.IdentityEncoder
Fully Qualified class name implementing PayloadEncoder, Encoder must have a no arg constructor.
-->
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
</analyzer>
</fieldtype>
<!-- lowercases the entire field value, keeping it as a single token. -->
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<!-- since fields of this type are by default not stored or indexed,
any data added to them will be ignored outright. -->
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
</types>
<fields>
<!-- Vu fields -->
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="sku" type="string" indexed="true" stored="true" required="false" />
<field name="collection" type="string" indexed="true" stored="true" required="false" />
<field name="title" type="text_fr" required="false" />
<field name="description" type="text_fr" required="false" />
<field name="price" type="float" required="false" indexed="true" stored="false" />
<field name="brand_id" type="text" required="false" />
<field name="date_online" type="date" required="false" />
<field name="product_type" type="text" required="false" />
<field name="selection_id" type="sint" required="false" multiValued="true" indexed="true" stored="false" />
<field name="stock_delay" type="sint" required="false" />
<field name="stock" type="sint" required="false" />
<field name="price_type" type="sint" required="false" />
<field name="main_product_id" type="text" required="false" />
<field name="date_price" type="date" required="false" />
<!-- attributes -->
<dynamicField name="attr_*" type="sint" indexed="true" multiValued="true"/>
<field name="attr_13" type="int" indexed="true" multiValued="false"/>
<field name="attr_14" type="int" indexed="true" multiValued="false"/>
<field name="attr_19" type="int" indexed="true" multiValued="false"/>
<!-- Ce champ contiendra la copie de tous les autres, pour faciliter la recherche -->
<field name="global" type="text_fr" required="false" multiValued="true" />
<!-- Valid attributes for fields:
name: mandatory - the name for the field
type: mandatory - the name of a previously defined type from the
<types> section
indexed: true if this field should be indexed (searchable or sortable)
stored: true if this field should be retrievable
compressed: [false] if this field should be stored using gzip compression
(this will only apply if the field type is compressable; among
the standard field types, only TextField and StrField are)
multiValued: true if this field may contain multiple values per document
omitNorms: (expert) set to true to omit the norms associated with
this field (this disables length normalization and index-time
boosting for the field, and saves some memory). Only full-text
fields or fields that need an index-time boost need norms.
termVectors: [false] set to true to store the term vector for a
given field.
When using MoreLikeThis, fields used for similarity should be
stored for best performance.
termPositions: Store position information with the term vector.
This will increase storage costs.
termOffsets: Store offset information with the term vector. This
will increase storage costs.
default: a value that should be used if no value is specified
when adding a document.
-->
<!--
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="sku" type="textTight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="textgen" indexed="true" stored="true"/>
<field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
<field name="manu" type="textgen" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true" />
<field name="features" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="includes" type="text" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
<field name="inStock" type="boolean" indexed="true" stored="true" />
-->
<!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich documents such as Word, PDF.
Some fields are multiValued only because Tika currently may return
multiple values for them.
-->
<!--
<field name="title" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text" indexed="true" stored="true"/>
<field name="description" type="text" indexed="true" stored="true"/>
<field name="comments" type="text" indexed="true" stored="true"/>
<field name="author" type="textgen" indexed="true" stored="true"/>
<field name="keywords" type="textgen" indexed="true" stored="true"/>
<field name="category" type="textgen" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
-->
<!-- catchall field, containing all other searchable text fields (implemented
via copyField further on in this schema -->
<!-- <field name="text" type="text" indexed="true" stored="false" multiValued="true"/> -->
<!-- catchall text field that indexes tokens both normally and in reverse for efficient
leading wildcard queries. -->
<!-- <field name="text_rev" type="text_rev" indexed="true" stored="false" multiValued="true"/> -->
<!-- non-tokenized version of manufacturer to make it easier to sort or group
results by manufacturer. copied from "manu" via copyField -->
<!-- <field name="manu_exact" type="string" indexed="true" stored="false"/> -->
<!-- <field name="payloads" type="payloads" indexed="true" stored="true"/> -->
<!-- Uncommenting the following will create a "timestamp" field using
a default value of "NOW" to indicate when each document was indexed.
-->
<!--
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
-->
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<!--
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
-->
<!-- some trie-coded dynamic fields for faster range queries -->
<!--
<dynamicField name="*_ti" type="tint" indexed="true" stored="true"/>
<dynamicField name="*_tl" type="tlong" indexed="true" stored="true"/>
<dynamicField name="*_tf" type="tfloat" indexed="true" stored="true"/>
<dynamicField name="*_td" type="tdouble" indexed="true" stored="true"/>
<dynamicField name="*_tdt" type="tdate" indexed="true" stored="true"/>
<dynamicField name="*_pi" type="pint" indexed="true" stored="true"/>
<dynamicField name="ignored_*" type="ignored" multiValued="true"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="random_*" type="random" />
-->
<!-- uncomment the following to ignore any fields that don't already match an existing
field name or dynamic field, rather than reporting them as an error.
alternately, change the type="ignored" to some other type e.g. "text" if you want
unknown fields indexed and/or stored by default -->
<!--dynamicField name="*" type="ignored" multiValued="true" /-->
</fields>
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a required field
-->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>global</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="OR"/>
<!-- copyField commands copy one field to another at the time a document
is added to the index. It's used either to index the same field differently,
or to add multiple fields to the same field for easier/faster searching. -->
<copyField source="title" dest="global"/>
<copyField source="description" dest="global"/>
</schema>
Based on the behavior described it sounds like you're trying to use basic SearchHandler query syntax out of the box to search against multiple fields. That's not going to work out as you'd hope.
There are numerous options available:
Front-end the query so that fully-qualified field names get sent (eg "fielda:foo OR fieldb:foo")
Copy the contents of searchable fields into a single search field (through copyField) and make that the default field to search
Use Solr Dismax syntax and specify multiple QueryFields (qf parameter in the request)
Since you have fields of different types, and want to apply wildcard matching and other such things, I'd recommend you go the Dismax route and look into creating a Query Handler that better suits your needs:
More info on:
The default SearchHandler: http://wiki.apache.org/solr/SearchHandler
Solr with Dismax: http://wiki.apache.org/solr/DisMaxQParserPlugin
Yes add a directive like this in your schema.xml after the field definitions:
<copyField source="sku" dest="text">
assuming that the defaultSearchField is set to text.
To search for all SKUs beginning with 96 you can search for 96*. Keep in mind though this will return all fields (not just SKUs) that begin with 96. To restrict it to SKUs, you will have to search for sku:96*.
You'll need a copyField setting for the fields you want to be searchable by default.
Since your defaultSearchField is set to global, try:
<copyField source="sku" dest="global"/>
You'll probably want to do the same for collection:
<copyField source="collection" dest="global"/>
In order to have partial matches (e.g.: ?q=95) without special operators, you need to tweak the NGram filter. Your current setting, for both the index-time and the query-time analyzer is:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6"/>
This means that partial matching will be available from 3 to 6 characters, per example:
959
9596
95962
596
...
If you want to allow it from 2 characters (e.g.: 95), change the minGramSize in both analyzers' filters and you should be good to go:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="6"/>
Lastly, your global field probably shouldn't be stored (by default) but only indexed:
<field name="global" type="text_fr" indexed="true" stored="false" required="false" multiValued="true" />
Remember that you need to restart Solr and re-index for the changes to be in effect.
Related
Using the Solr admin interface under Schema, I am trying to create a catch-all Copy Field that searches all other fields.
When entering * as the source and search as the destination, the admin interface returns:
error processing commands
How can a catch-all Copy Field that searches all other fields be created?
In your schema.xml you can have this fields:
<field name="destination" type="text" indexed="true" stored="true" required="false"/>
<field name="country" type="text" indexed="false" stored="true" required="false" />
<field name="city" type="text" indexed="false" stored="true" required="false" />
<field name="state" type="text" indexed="false" stored="true" required="false" />
country, city and state are the source fields.
then can add the source field to destination as the following:
<copyField source="city" dest="destination"/>
<copyField source="state" dest="destination"/>
<copyField source="country" dest="destination"/>
Or you can also have something like as your source fields
<field name="destination" type="text" indexed="true" stored="true" required="false"/>
<field name="country" type="text" indexed="false" stored="true" required="false">
<field name="city" type="text" indexed="false" stored="true" required="false" />
<copyField source="*_y" dest="destination"/>
You can apply any suitable field type for your field destination
You can also add a field type text as below. This is an example for your reference. Which field type to use and what tokenizer to use and filters to use is all depends on your requirements.
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
I'm using Solr to query a set of documents and I want to get the number of matches for certain term, right now I'm using
termfreq(text,'manage')
However this does not hit on Manager or Management
termfreq(text,'manage*')
returns the same count. I've tried using different tokenizers, some won't even accept the * and I haven't found one that returns the correct number of matches.
Field:
<field name="text" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" required="false"/>
Is there a way I can get termfreq to also count partial matches?
You will need to add some custom tokenizers and and filter classes to the analyzer.
In your /shared/field_types.xml file, create a new type like this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And in /shared/fields.xml:
<field name="text" stored="true" type="text" multiValued="false" indexed="true"/>
<dynamicField name="*_text" stored="true" type="text" multiValued="false" indexed="true"/>
And use that as "text" as the type of the field.
A more advanced solution:
<fieldType name="startsWith" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- remove words/chars we don't care about -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<!-- now remove any extra space we have, since spaces WILL influence matching -->
<filter class="solr.PatternReplaceFilterFactory" pattern="\s+" replacement=" " replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="\s+" replacement=" " replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
In /shared/fields.xml:
<dynamicField name="*_starts_with" stored="true" type="startsWith" multiValued="false" indexed="true"/>
Then, in the top level of your core's schema.xml add this:
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../../shared/fields.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../../shared/field_types.xml"/>
And add this to your copyFields in the core's schema.xml:
<copyFields>
<copyField source="yourField" dest="yourField_text"/>
<copyField source="yourField" dest="yourField_starts_with"/>
...
</copyFields>
I have had the same problem. I needed to count the termfreq, which also should match on subparts of words.
Add this FieldType solved it.
<fieldType name="startWith" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I have a Pincode field like "389151" and I want to store this in below format.
pincode_analyzed: [
"389151",
"38915",
"3891",
"389"
]
You can copy the Pincode field to the pincode_analyzed field defined with a fieldType similar to this:
<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<field name="pincode_analyzed" type="ngrams" indexed="true" stored="true" multiValued="true" />
<copyField source="Pincode" dest="pincode_analyzed"/>
You can read more about tokenizers here: https://cwiki.apache.org/confluence/display/solr/Tokenizers
Consider the following schema,
<schema>
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" multiValued="false"/>
<fieldType name="stop_analyzer_string" class="solr.TextField" multiValued="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="name_search" type="stop_analyzer_string" indexed="true" stored="false"/>
<copyField source="name" dest="name_search"/>
<field name="name" type="string" indexed="true" stored="true"/>
</fields>
</schema>
The name field gets indexed with WhitespaceTokenizerFactory, but it doesn't seem to use the WhitespaceTokenizerFactory while querying with the name field.
For a doc with name as "solr search",
the query name_search:solr - matches the document. //index time WhiteSpace tokenizer works
the query name_search:search - matches the document. //index time WhiteSpace tokenizer works
But the query name_search:solr search - doesn't match the document. //query time WhiteSpace tokenizer doesn't work
But as specified in the schema, the query should also be tokenized with whitespace and matched with the document. no?
Not sure what you are missing, but all the above queries worked for me for the data that you mentioned.
http://localhost:8983/solr/collection1/select?q=name_search%3Asolr+search&wt=xml&indent=true
The above returned result document i indexed.
Just to test do this:
http://localhost:8983/solr/#/collection1/documents
Got to :
And paste below document as is into your Document(s) part and hit Submit Document
{"id":"100001","name_search":"solr search"}
Run you query as:
http://localhost:8983/solr/collection1/select?q=name_search%3Asolr+search&wt=json&indent=true
EDIT 3: The workaround I'm using right now is to strip anything but letters, digits, and whitespace from both my queries and my indexed fields. This produces the desired behavior, but it's very much a workaround rather than a true solution, and I would still like to understand why Solr is doing what it's doing...so still interested in an answer, if anyone has one. END EDIT 3
I have a document named "TT-14B" indexed by Solr 1.4 (via Django/Haystack). When I query the content_auto field for "tt-1" or "tt 14" or "tt 14b" I get the document back; when I query "tt-14" or "tt-14b" I get no results. I edited the Haystack-generated Solr schema a bit to try to fix this, to no avail. Using analyze.jsp, it seems to me that I should be getting a match for "tt-14"; I should certainly be getting one for "tt-14b". (Edit: Oh, and changing the default operator from AND to OR doesn't help.)
Can someone help me understand why this isn't working? Thanks.
...
results
QUERY | WORKS
=======|======
tt | yes
tt- | yes
tt-1 | yes
tt-14 | no
tt-14b | no
tt 14 | yes
tt 14b | yes
EDIT 2
Got some more comparably weird results, might help debug the problem. In this case the test document was "abc'def".
QUERY | WORKS
========|======
abc | yes
abc'd | yes
abc'de | no
abc'def | no
Same pattern, obviously, but I don't understand what's causing it.
END EDIT 2
schema.xml relevant part (full file below)
<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
</analyzer>
</fieldType>
schema.xml (full)
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<schema name="default" version="1.1">
<types>
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>
<!-- Numeric field types that manipulate the value into
a string value that isn't human-readable in its internal form,
but with a lexicographic ordering the same as the numeric ordering,
so that range queries work correctly. -->
<fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="1" catenateNumbers="1" catenateAll="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="ngram" class="solr.TextField" >
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" splitOnNumerics="0" preserveOriginal="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
</analyzer>
</fieldType>
</types>
<fields>
<!-- general -->
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" />
<field name="django_id" type="string" indexed="true" stored="true" multiValued="false" />
<dynamicField name="*_i" type="sint" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="slong" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/>
<dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
<field name="modelname_exact" type="string" indexed="true" stored="true" multiValued="false" />
<field name="modelname" type="text" indexed="true" stored="true" multiValued="false" />
<field name="name" type="text" indexed="true" stored="true" multiValued="false" />
<field name="text" type="text" indexed="true" stored="true" multiValued="false" />
<field name="name_exact" type="string" indexed="true" stored="true" multiValued="false" />
<field name="content_auto" type="edge_ngram" indexed="true" stored="true" multiValued="true" />
</fields>
<!-- field to use to determine and enforce document uniqueness. -->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>text</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="AND" />
</schema>
An screenshout of the /admin/analysis.jsp for every case would be interesting.
Is there a reason, why positionIncrementGap="1"is set to 1?
tt-14b and tt 14b are handled different, because of the whitspace tokenizer.
That means: tt-14b is one term, as long as the WordDelimiterFilterFactorydoesn't fired, while tt 14b are 2 terms from the beginning.
The positionIncrementGap gives you the possibility to see different terms as one phrase, even if there are not neighbors, but on the next "n" position. So try to rise the positionIncrementGap.
Btw: The first i notice on your schema.xml are the missing "EdgeNGramFilterFactory" at query time. Which should be okay. But there are also understandable reasons, why "same filters on query- and index-time" are handled as best practices.
This depends on every special situation, but activating this filter on query-time would be a try.
A little late to the show on this one -- but, as noted above, the WhitespaceTokenizerFactory breaks words with hyphens if it's passed through the StandardAnalyzer. I found this out the hard way too...:
Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - Analyzer results inconsistent with Query Results
The solution is probably to use the KeywordAnalyzer -- it shouldn't split anything.
I came up with a similar work-around to your "EDIT 3" on the link above (in PHP).
The particularly frustrating thing about the Solr analyzer is that it shows everything is fine and behaving as expected -- which really confused me.
Good luck!