Solr indexing, split string field into list

Solr indexing, split string field into list - text

In Solr, I want to index string field as list by splitting it.
Below is my indexing query in data_config.xml file.
<document name="Example">
<entity dataSource="example_table" name="Example"
query="select id, text from example_table"
pk="id"
transformer="RegexTransformer"
>
<field column="id" name="id" />
<field column="text" name="text" />
</entity>
Field text is a comma separated string. Example: "A, B, C"
Below is the field definition in schema.xml file
<field name="text" type="string" indexed="true" stored="true" required="false" multiValued="true" />
When I'm querying Solr the output is:
"text":["A, B, C"]
Could someone explain me how can I get the result as below?
"text":["A","B","C"]

To do it in your DataImportHandler definition (since you've already added the RegexTransformer):
<field column="text" name="text" splitBy=", " />
Or do it in your field definition by using a TextField with a Regular Expression Pattern Tokenizer:
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=","/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>

Related

Solr Better search result with adjacent query keyword

I have configured solr for my ecommerce application (That mostly contains books data). The search result does not seem to return what I expect.
Following is the configuration.
schema.xml
`
<field name="namespace" type="string" indexed="true" stored="false" />
<field name="id" type="string" indexed="true" stored="true" />
<field name="productId" type="long" indexed="true" stored="true" />
<field name="skuId" type="long" indexed="true" stored="true" />
<field name="category" type="long" indexed="true" stored="false" multiValued="true" />
<field name="explicitCategory" type="long" indexed="true" stored="false" multiValued="true" />
<field name="searchable" type="text_general" indexed="true" stored="false" />
<dynamicField name="*_searchable" type="text_general" indexed="true" stored="false" />
<dynamicField name="*_i" type="int" indexed="true" stored="false" />
<dynamicField name="*_is" type="int" indexed="true" stored="false" multiValued="true" />
<dynamicField name="*_s" type="string" indexed="true" stored="false" />
<dynamicField name="*_ss" type="string" indexed="true" stored="false" multiValued="true" />
<dynamicField name="*_l" type="long" indexed="true" stored="false" />
<dynamicField name="*_ls" type="long" indexed="true" stored="false" multiValued="true" />
<dynamicField name="*_t" type="text_general" indexed="true" stored="false" />
<dynamicField name="*_txt" type="text_general" indexed="true" stored="false" multiValued="true" />
<dynamicField name="*_b" type="boolean" indexed="true" stored="false" />
<dynamicField name="*_bs" type="boolean" indexed="true" stored="false" multiValued="true" />
<dynamicField name="*_d" type="double" indexed="true" stored="false" />
<dynamicField name="*_ds" type="double" indexed="true" stored="false" multiValued="true" />
<dynamicField name="*_p" type="double" indexed="true" stored="false" />
<dynamicField name="*_dt" type="date" indexed="true" stored="false" />
<dynamicField name="*_dts" type="date" indexed="true" stored="false" multiValued="true" />
<!-- some trie-coded dynamic fields for faster range queries -->
<dynamicField name="*_ti" type="tint" indexed="true" stored="false" />
<dynamicField name="*_tl" type="tlong" indexed="true" stored="false" />
<dynamicField name="*_td" type="tdouble" indexed="true" stored="false" />
<dynamicField name="*_tdt" type="tdate" indexed="true" stored="false" />
<!-- Both field types required for geolocation searches. First stores the
lat and lon components for the "coordinate" FieldType. Second stores
the coordinate. -->
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
<dynamicField name="*_c" type="coordinate" indexed="true" stored="false"/>
</fields>
<uniqueKey>id</uniqueKey>
<types>
<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" />
<!-- Default numeric field types. For faster range queries, consider the
tint/tlong/tdouble types. -->
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0" />
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0" />
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0" />
<!-- Numeric field types that index each value at various levels of precision
to accelerate range queries when the number of values between the range endpoints
is large. See the javadoc for NumericRangeQuery for internal implementation
details. Smaller precisionStep values (specified in bits) will lead to more
tokens indexed per value, slightly larger index size, and faster range queries.
A precisionStep of 0 disables indexing at different precision levels. -->
<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" positionIncrementGap="0" />
<fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0" />
<fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0" />
<!-- The format for this date field is of the form 1995-12-31T23:59:59Z,
and is a more restricted form of the canonical representation of dateTime
http://www.w3.org/TR/xmlschema-2/#dateTime The trailing "Z" designates UTC
time and is mandatory. Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
All other components are mandatory. Expressions can also be used to denote
calculations that should be performed relative to "NOW" to determine the
value, ie... NOW/HOUR ... Round to the start of the current hour NOW-1DAY
... Exactly 1 day prior to now NOW/DAY+6MONTHS+3DAYS ... 6 months and 3 days
in the future from the start of the current day Consult the DateField javadocs
for more information. Note: For faster range queries, consider the tdate
type -->
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0" />
<!-- A Trie based date field for faster date range queries and date faceting. -->
<fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0" />
<!-- A general text field that has reasonable, generic cross-language defaults:
it tokenizes with StandardTokenizer and down cases. -->
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->
<fieldType name="coordinate" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
</types>
`
solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>4.10.3</luceneMatchVersion>
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}" />
<updateHandler class="solr.DirectUpdateHandler2" />
<query>
<maxBooleanClauses>1024</maxBooleanClauses>
<filterCache class="solr.FastLRUCache" size="512" initialSize="512" autowarmCount="0" />
<queryResultCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0" />
<documentCache class="solr.LRUCache" size="512" initialSize="512" autowarmCount="0" />
<cache name="perSegFilter" class="solr.search.LRUCache" size="10" initialSize="0" autowarmCount="10"
regenerator="solr.NoOpRegenerator" />
<enableLazyFieldLoading>true</enableLazyFieldLoading>
<queryResultWindowSize>20</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
<listener event="newSearcher" class="solr.QuerySenderListener" />
<listener event="firstSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst>
<str name="q">static firstSearcher warming in solrconfig.xml</str>
</lst>
</arr>
</listener>
<useColdSearcher>false</useColdSearcher>
<maxWarmingSearchers>2</maxWarmingSearchers>
</query>
<requestDispatcher handleSelect="false">
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048000" formdataUploadLimitInKB="2048"
addHttpRequestToContext="false"/>
<httpCaching never304="true" />
</requestDispatcher>
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rowsa">10</int>
<str name="df">name_t</str>
</lst>
</requestHandler>
<queryResponseWriter name="json" class="solr.JSONResponseWriter">
<str name="content-type">text/plain; charset=UTF-8</str>
</queryResponseWriter>
For example when I search for 2 states it gives me lot of random results, which does not even contain 2 states in the title.
However when I search for 2 states in phrase "2 States", I do get the relevant results"
I dont want to restrict every search into quotes, since user might search for some combination like "book by author" which certainly give 0 results if searched in phrase since it wont match the exact phrase.
How can I imporve my search so that I can list most relevant results on the top.

You can use the pf2 and pf3 parameters in the edismax handler to give boosts to documents where two (pf2) or three (pf3) of your terms are found after each other in the field.
defType=edismax&pf2=title^4
You also have the pf argument for the regular dismax handler, but that's built on the assumption that all the terms are close together. It might help, but pf2 or pf3 sounds better suited for what you need.

Solr won't search on fields belonging to nested entities

I am using apache solr for bulding search for a website.
I am using nested entities to import data from different tables. Dataimport is successfull and all the documents are being added to the index. My dataConfig goes like this :
<dataConfig>
<dataSource type="JdbcDataSource" driver ="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/purplle_purplle2" user="purplle_purplle" password="purplle123" />
<document name="doc">
<entity name="offer" query="SELECT * FROM service_offering WHERE module LIKE 'location' ">
<field column="name" name="name"/>
<field column="id" name="id" />
<field column="type_id" name="type_id" />
<entity name="offer_type" query=" select name from service_offeringtype where id='${offer.type_id}'" >
<field column="name" name="offer_type" />
</entity>
<entity name="offer_location" query=" select name from service_location where id='${offer.module_id}'" >
<field column="name" name="location_name" />
</entity>
<entity name="offer_address" query=" select * from service_address where module_id='${offer.module_id}' AND module LIKE 'location'" >
<entity name="loc_city" query=" select name from loc_city where id='${offer_address.city}'" >
<field column="name" name="loc_city" />
</entity>
<entity name="loc_area" query=" select name from loc_area where id='${offer_address.area}'" >
<field column="name" name="loc_area" />
</entity>
<entity name="loc_zone" query=" select name from loc_zone where id='${offer_address.zone}'" >
<field column="name" name="loc_zone" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
Now if i directly do a search on this index. The results are fetched only for "name" field. It returns null for other fields namely "loc_area","location_name","loc_city" etc.
My schema looks like this
<field name="id" type="int" indexed="true" stored="true" />
<field name="name" type="string" indexed="true" stored="true" />
<field name="offer_type" type="string" indexed="true" stored="true" />
<field name="location_name" type="string" indexed="true" stored="true" />
<field name="type_id" type="string" indexed="true" stored="true" />
<field name="loc_city" type="string" indexed="true" stored="true" />
<field name="loc_area" type="string" indexed="true" stored="true" />
<field name="loc_zone" type="string" indexed="true" stored="true" />
However if i copy these fields into a "text" field which is present by default in the schema.xml. Then by searching on "text" field i easily get the relevant results.
<copyField source="name" dest="text"/>
<copyField source="offer_type" dest="text"/>
<copyField source="location_name" dest="text"/>
<copyField source="loc_city" dest="text"/>
<copyField source="loc_area" dest="text"/>
<copyField source="loc_zone" dest="text"/>
But i cannot do it like this because i have to assign boost levels to different fields for calcuation of score. The moment i append this in the query syntax "&defType=edismax&qf=name^1.0+location_name^10.0+loc_area^50.0" it returns null results.
What is wrong?

My guess is that your problem is the type of your fields. I don't know exactly what your fields contain, but there is a difference between type="string" and type = "text".
The String type indexes an untokenized String value of the entire field input. Text type tokenizes and analyzes the the field. For example, if I search for "john" against a string field containing "John Smith" I would not expect a hit, where if the field were a text field, I would get a hit.
Since your query seems to work against a text field, and not a string field, changing the types and reindexing seems to be a likely solution.

Quering numbers containing hyphens with SOLR WordDelimiterFilterFactory isn't working?

I'm trying to configure solr 4.0-BETA with a WordDelimiterFilterFactory so I can query numbers containing hyphens.
Field value: "123456-1234" when adding to ssn.
Queries:
"123456-1234" <- Works (with hyphen)
"1234561234" <- Doesn't work (without hyphen)
According to the documentation (AFAIUI) it should match since the fieldtype has generateNumberParts and catenateNumbers.
From the documentation:
generateNumberParts="1" causes number subwords to be generated:
"500-42" => "500" "42"
catenateNumbers="1" causes maximum runs of
number parts to be catenated: "500-42" => "50042"
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
My fields:
<fields>
<field name="ssn" type="text_en_splitting" indexed="true" stored="false" multiValued="false" />
<field name="ssn_exact" type="string" indexed="true" stored="true" multiValued="false" />
</fields>
<copyField source="ssn" dest="ssn_exact" />
<copyField source="ssn" dest="text" />
The filter in text_en_splitting:
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
What am I missing here?

I created a similar field in my local schema and using the analysis tool which is under the Solr Admin. (http://localhost:8983/solr/#/collection1/analysis - Note this url assumes solr running on http://localhost:8983/ and your index is named collection1 - modify as necessary).
I tried running your value to index and the query against text_en_splitting selected in the Analyse FieldName/FieldType dropdown. You will see from the results that the value 1234561234 is never added as an index term for this field type.
However, if you use the text_en_splitting_tight FieldType, then the behavior you want is being produced as the hypen is removed and 1234561234 is a term being added to the index. So I would switch the field type as follows and reindex and you should be set to go.
<fields>
<field name="ssn" type="text_en_splitting_tight" indexed="true" stored="false" multiValued="false" />
<field name="ssn_exact" type="string" indexed="true" stored="true" multiValued="false" />
</fields>
<copyField source="ssn" dest="ssn_exact" />
<copyField source="ssn" dest="text" />

Solr does not search into integers?

I'm currently developping a search engine using Solr for an ecommerce website. So I get these two fields in my schema.xml:
<field name="sku" type="string" indexed="true" stored="true" required="false" />
<field name="collection" type="string" indexed="true" stored="true" required="false" />
(The complete schema.xml is available below)
For information:
sku looks like this: 959620, 929345, 912365, ...
collection looks like this: Alcott, Spigrim, Tantal,...
They are well indexed. For instance, when I look for:
http://localhost:8080/solr/myindex/select/?q=Alcott
I got all products with collection "Alcott".
But when I look for;
http://localhost:8080/solr/myindex/select/?q=959620
I got nothing.
However, when I go deep forward with this request,
http://localhost:8080/solr/myindex/select/?q=sku:969520
I do have the product attached to this sku.
Is there any way to have "q=969520" working ? And even better: "q=96" resulting all products with sku starting by "96" ?
Thank you for your help !
schema.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.2">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>
<!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
<fieldtype name="binary" class="solr.BinaryField"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
<!-- A Trie based date field for faster date range queries and date faceting. -->
<fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/>
<fieldType name="pint" class="solr.IntField" omitNorms="true"/>
<fieldType name="plong" class="solr.LongField" omitNorms="true"/>
<fieldType name="pfloat" class="solr.FloatField" omitNorms="true"/>
<fieldType name="pdouble" class="solr.DoubleField" omitNorms="true"/>
<fieldType name="pdate" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="random" class="solr.RandomSortField" indexed="true" />
<!-- A text field that only splits on whitespace for exact matching of words -->
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
</fieldType>
<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<!-- normalisation des accents, cédilles, e dans l'o,... -->
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<!-- découpage selon les espaces -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- suppression de la ponctuation -->
<filter class="solr.PatternReplaceFilterFactory" pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
<!-- suppression des tokens vides et des mots démesurés -->
<filter class="solr.LengthFilterFactory" min="1" max="100" />
<!-- passage en minuscules -->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- suppression des élisions (l', qu',...) -->
<filter class="solr.ElisionFilterFactory" articles="elisionwords_fr.txt"/>
<!-- découpage des mots composés -->
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
<!-- suppression des mots insignifiants -->
<filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_fr.txt" enablePositionIncrements="true"/>
<!-- gestion des synonymes -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_fr.txt" ignoreCase="true" expand="true"/>
<!-- partie de mot -->
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6"/>
<!-- lemmatisation (pluriels,...) -->
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords_fr.txt"/>
<!-- suppression des doublons éventuels -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="index">
<!-- normalisation des accents, cédilles, e dans l'o,... -->
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<!-- découpage selon les espaces -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- suppression de la ponctuation -->
<filter class="solr.PatternReplaceFilterFactory" pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
<!-- suppression des tokens vides et des mots démesurés -->
<filter class="solr.LengthFilterFactory" min="1" max="100" />
<!-- passage en minuscules -->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- suppression des élisions (l', qu',...) -->
<filter class="solr.ElisionFilterFactory" articles="elisionwords_fr.txt"/>
<!-- découpage des mots composés -->
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
<!-- suppression des mots insignifiants -->
<filter class="solr.StopFilterFactory" ignoreCase="1" words="stopwords_fr.txt" enablePositionIncrements="true"/>
<!-- gestion des synonymes -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_fr.txt" ignoreCase="true" expand="true"/>
<!-- partie de mot -->
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6"/>
<!-- lemmatisation (pluriels,...) -->
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords_fr.txt"/>
<!-- suppression des doublons éventuels -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- Less flexible matching, but less false matches. Probably not ideal for product names,
but may be good for SKUs. Can insert dashes in the wrong place and still match. -->
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<!-- this filter can remove any duplicate tokens that appear at the same position - sometimes
possible with WordDelimiterFilter in conjuncton with stemming. -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<!-- A general unstemmed text field - good if one does not know the language of the field -->
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- A general unstemmed text field that indexes tokens normally and also
reversed (via ReversedWildcardFilterFactory), to enable more efficient
leading wildcard queries. -->
<fieldType name="text_rev" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
<fieldtype name="phonetic" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
</analyzer>
</fieldtype>
<fieldtype name="payloads" stored="false" indexed="true" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--
The DelimitedPayloadTokenFilter can put payloads on tokens... for example,
a token of "foo|1.4" would be indexed as "foo" with a payload of 1.4f
Attributes of the DelimitedPayloadTokenFilterFactory :
"delimiter" - a one character delimiter. Default is | (pipe)
"encoder" - how to encode the following value into a playload
float -> org.apache.lucene.analysis.payloads.FloatEncoder,
integer -> o.a.l.a.p.IntegerEncoder
identity -> o.a.l.a.p.IdentityEncoder
Fully Qualified class name implementing PayloadEncoder, Encoder must have a no arg constructor.
-->
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
</analyzer>
</fieldtype>
<!-- lowercases the entire field value, keeping it as a single token. -->
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<!-- since fields of this type are by default not stored or indexed,
any data added to them will be ignored outright. -->
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
</types>
<fields>
<!-- Vu fields -->
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="sku" type="string" indexed="true" stored="true" required="false" />
<field name="collection" type="string" indexed="true" stored="true" required="false" />
<field name="title" type="text_fr" required="false" />
<field name="description" type="text_fr" required="false" />
<field name="price" type="float" required="false" indexed="true" stored="false" />
<field name="brand_id" type="text" required="false" />
<field name="date_online" type="date" required="false" />
<field name="product_type" type="text" required="false" />
<field name="selection_id" type="sint" required="false" multiValued="true" indexed="true" stored="false" />
<field name="stock_delay" type="sint" required="false" />
<field name="stock" type="sint" required="false" />
<field name="price_type" type="sint" required="false" />
<field name="main_product_id" type="text" required="false" />
<field name="date_price" type="date" required="false" />
<!-- attributes -->
<dynamicField name="attr_*" type="sint" indexed="true" multiValued="true"/>
<field name="attr_13" type="int" indexed="true" multiValued="false"/>
<field name="attr_14" type="int" indexed="true" multiValued="false"/>
<field name="attr_19" type="int" indexed="true" multiValued="false"/>
<!-- Ce champ contiendra la copie de tous les autres, pour faciliter la recherche -->
<field name="global" type="text_fr" required="false" multiValued="true" />
<!-- Valid attributes for fields:
name: mandatory - the name for the field
type: mandatory - the name of a previously defined type from the
<types> section
indexed: true if this field should be indexed (searchable or sortable)
stored: true if this field should be retrievable
compressed: [false] if this field should be stored using gzip compression
(this will only apply if the field type is compressable; among
the standard field types, only TextField and StrField are)
multiValued: true if this field may contain multiple values per document
omitNorms: (expert) set to true to omit the norms associated with
this field (this disables length normalization and index-time
boosting for the field, and saves some memory). Only full-text
fields or fields that need an index-time boost need norms.
termVectors: [false] set to true to store the term vector for a
given field.
When using MoreLikeThis, fields used for similarity should be
stored for best performance.
termPositions: Store position information with the term vector.
This will increase storage costs.
termOffsets: Store offset information with the term vector. This
will increase storage costs.
default: a value that should be used if no value is specified
when adding a document.
-->
<!--
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="sku" type="textTight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="textgen" indexed="true" stored="true"/>
<field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
<field name="manu" type="textgen" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true" />
<field name="features" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="includes" type="text" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
<field name="inStock" type="boolean" indexed="true" stored="true" />
-->
<!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich documents such as Word, PDF.
Some fields are multiValued only because Tika currently may return
multiple values for them.
-->
<!--
<field name="title" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text" indexed="true" stored="true"/>
<field name="description" type="text" indexed="true" stored="true"/>
<field name="comments" type="text" indexed="true" stored="true"/>
<field name="author" type="textgen" indexed="true" stored="true"/>
<field name="keywords" type="textgen" indexed="true" stored="true"/>
<field name="category" type="textgen" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
-->
<!-- catchall field, containing all other searchable text fields (implemented
via copyField further on in this schema -->
<!-- <field name="text" type="text" indexed="true" stored="false" multiValued="true"/> -->
<!-- catchall text field that indexes tokens both normally and in reverse for efficient
leading wildcard queries. -->
<!-- <field name="text_rev" type="text_rev" indexed="true" stored="false" multiValued="true"/> -->
<!-- non-tokenized version of manufacturer to make it easier to sort or group
results by manufacturer. copied from "manu" via copyField -->
<!-- <field name="manu_exact" type="string" indexed="true" stored="false"/> -->
<!-- <field name="payloads" type="payloads" indexed="true" stored="true"/> -->
<!-- Uncommenting the following will create a "timestamp" field using
a default value of "NOW" to indicate when each document was indexed.
-->
<!--
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
-->
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<!--
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
-->
<!-- some trie-coded dynamic fields for faster range queries -->
<!--
<dynamicField name="*_ti" type="tint" indexed="true" stored="true"/>
<dynamicField name="*_tl" type="tlong" indexed="true" stored="true"/>
<dynamicField name="*_tf" type="tfloat" indexed="true" stored="true"/>
<dynamicField name="*_td" type="tdouble" indexed="true" stored="true"/>
<dynamicField name="*_tdt" type="tdate" indexed="true" stored="true"/>
<dynamicField name="*_pi" type="pint" indexed="true" stored="true"/>
<dynamicField name="ignored_*" type="ignored" multiValued="true"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="random_*" type="random" />
-->
<!-- uncomment the following to ignore any fields that don't already match an existing
field name or dynamic field, rather than reporting them as an error.
alternately, change the type="ignored" to some other type e.g. "text" if you want
unknown fields indexed and/or stored by default -->
<!--dynamicField name="*" type="ignored" multiValued="true" /-->
</fields>
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a required field
-->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>global</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="OR"/>
<!-- copyField commands copy one field to another at the time a document
is added to the index. It's used either to index the same field differently,
or to add multiple fields to the same field for easier/faster searching. -->
<copyField source="title" dest="global"/>
<copyField source="description" dest="global"/>
</schema>

Based on the behavior described it sounds like you're trying to use basic SearchHandler query syntax out of the box to search against multiple fields. That's not going to work out as you'd hope.
There are numerous options available:
Front-end the query so that fully-qualified field names get sent (eg "fielda:foo OR fieldb:foo")
Copy the contents of searchable fields into a single search field (through copyField) and make that the default field to search
Use Solr Dismax syntax and specify multiple QueryFields (qf parameter in the request)
Since you have fields of different types, and want to apply wildcard matching and other such things, I'd recommend you go the Dismax route and look into creating a Query Handler that better suits your needs:
More info on:
The default SearchHandler: http://wiki.apache.org/solr/SearchHandler
Solr with Dismax: http://wiki.apache.org/solr/DisMaxQParserPlugin

Yes add a directive like this in your schema.xml after the field definitions:
<copyField source="sku" dest="text">
assuming that the defaultSearchField is set to text.
To search for all SKUs beginning with 96 you can search for 96*. Keep in mind though this will return all fields (not just SKUs) that begin with 96. To restrict it to SKUs, you will have to search for sku:96*.

You'll need a copyField setting for the fields you want to be searchable by default.
Since your defaultSearchField is set to global, try:
<copyField source="sku" dest="global"/>
You'll probably want to do the same for collection:
<copyField source="collection" dest="global"/>
In order to have partial matches (e.g.: ?q=95) without special operators, you need to tweak the NGram filter. Your current setting, for both the index-time and the query-time analyzer is:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6"/>
This means that partial matching will be available from 3 to 6 characters, per example:
959
9596
95962
596
...
If you want to allow it from 2 characters (e.g.: 95), change the minGramSize in both analyzers' filters and you should be good to go:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="6"/>
Lastly, your global field probably shouldn't be stored (by default) but only indexed:
<field name="global" type="text_fr" indexed="true" stored="false" required="false" multiValued="true" />
Remember that you need to restart Solr and re-index for the changes to be in effect.

Facet + Query on Text in order to get an autocomplete

I would like to get an auto-suggest / auto-complete field in my application and I am able to get that on a string field, but faceting or querying is not "working" on a text field as on a string field, specially with spaced words.
For now my request is q=cleared_keywords:piso\%20e*&facet=on&facet.field=cleared_keywords&facet.sort=result_count&facet.mincount=1&version=2.2&start=0&rows=0&indent=on&facet.limit=10
and my schema is :
<fields>
<field name="id" type="integer" indexed="true" stored="true" required="true"/>
<field name="country" type="string" indexed="true" stored="true" required="true"/>
<field name="city_id" type="integer" indexed="true" stored="true" required="false"/>
<field name="ad_type" type="integer" indexed="true" stored="true" required="true"/>
<field name="keywords" type="text" indexed="true" stored="true" required="true"/>
<field name="result_count" type="sint" indexed="true" stored="true" required="true"/>
<field name="hash" type="integer" indexed="true" stored="true" required="true"/>
<field name="cleared_keywords" type="string" indexed="true" stored="true" required="false"/>
<field name="keywords_score" type="sfloat" indexed="true" stored="true" required="true"/>
<field name="sorted_keywords" type="string" indexed="true" stored="true" required="true"/>
<field name="links_to" type="integer" indexed="true" stored="true" multiValued="true"/>
<field name="keywordsAsSuggestion" type="string" indexed="true" stored="true" />
<dynamicField name="random*" type="rand" indexed="true" stored="true"/>
<copyField source="keywords" dest="keywordsAsSuggestion" />
</fields>
If I try same query on text (keywords) field it's not working because of text.
I don't understand how copyField is working, do I need to reload / recreate the index ?
I wanted to skip the "recreate index" step, but if I can't I'll just load all Solr Document and recreate new ones with a String field with the keywords text field values ... I just don't like that idea.
Regards,
Alexis

The analyzers and tokenizers defined for field type text is different from that of string in the default schema.xml. If you want to try providing phrases for auto suggest, then it would be better to define your own field type with necessary analyzers and tokenizers. This gives detailed information about them.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Solr indexing, split string field into list - text

Related

Solr Better search result with adjacent query keyword

Solr won't search on fields belonging to nested entities

Quering numbers containing hyphens with SOLR WordDelimiterFilterFactory isn't working?

Solr does not search into integers?

Facet + Query on Text in order to get an autocomplete

Categories

Resources