Why did they create the concept of "schema.xml" in Solr? - search

Lucene does searching and indexing, all by taking "coding"... Why doesn't Solr do the same ? Why do we need a schema.xml ? Whats its importance ? Is there a way to avoid placing all the fields we want into a schema.xml ? ( I guess dynamic fields are the way to go, right ? )

That's just the way it was built. Lucene is a library, so you link your code against it. Solr, on the other hand, is a server, and in some cases you can just use it with very little coding (e.g. using DataImportHandler to index and Velocity plugin to browse and search).
The schema allows you to declaratively define how each field is analyzed and queried.
If you want a schema-less server based on Lucene, take a look at ElasticSearch.

If you want to avoid constantly tweaking your schema.xml, then dynamic fields are indeed the way to go. For an example, I like the Sunspot schema.xml — it uses dynamic fields to set up type-based naming conventions in field names.
https://github.com/outoftime/sunspot/blob/master/sunspot/solr/solr/conf/schema.xml
Based on this schema, a field named content_text would be parsed as a text field:
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/>
Which corresponds to its earlier definition of the text fieldType.
Most schema.xml files that I work with start off based on the Sunspot schema. I have found that you can save a lot of time by establishing and reusing a good convention in your schema.xml.

Solr acts as a stand-alone search server and can be configured with no coding. You can think of it as a front-end for Lucene. The purspose of the schema.xml file is to define your index.
If possible, I would suggest defining all your fields in the schema file. This gives you greater control over how those fields are indexed and it will allow you to take advantage of copy fields (if you need them).

Related

Indexing dynamic fields in azure search

I have used solr search engine which has a feature of dynamic fields. For example , if we define product_* field in the schema.xml, it will accept all the fields starting with product_ during the indexing.
Is there a feature like this in azure search where we can just define a wildcard for a field and it can accept the related fields in the indexing? As the fixed field thing reduces flexibility and one has to define a new schema every time for adding new fields.
Azure Cognitive Search does not support dynamic fields. Adding fields to the schema as you detect them during indexing is the suggested workaround.
Please consider creating an item on our User Voice page for this. While we haven't considered adding support for dynamic fields specifically, we have been looking at making schemas more flexible and extensible, and your input could help us prioritize this.

Hybris: Use same field for search and facet

I have to use a field "manufacturerName" for both solr search and solr facet in Hybris. While the solr free text search requires the field type to be text, the facet only works properly in string type.
Is there any way to use this same field for both search and facet. I think there is one way by using "copyField" but I searched a lot, and still don't know how to use it?
Any help would be highly appreciated!
PS: On keeping the field type string, free text search doesn't fetch proper results. On keeping the field type text, facet shows truncated values.
Using a copyField instruction is the way to go, but that require you to define an alternative field - meaning you have one field with the type text and the associated tokenization, and one field of the type string which isn't processed in any way. There is no way in Solr to combine these in a single field that I know of.
You'll then use the name of the string field to generate the facets, while you use the other field when you're querying.
<copyField source="text_search_field" dest="string_facet_field" />
You'll then have to refer to the name string_facet_field when you're filtering or faceting on the field. You'll want to filter against the facet field after the user selects a facet, since you otherwise would end up with documents from other facets possibly leaking into your document result set (for example if the facet was "Foo Bar", you'd suddenly get documents that had "Baz Foo Bar Spam" as the facet, since both words are present in the search string.
I was not able to implement the "copyField" approach, but I found another easy way to do this. In solr.impex, I had already added my new field manufacturerNameFacet of type string, but there is a parameter "fieldValueProvider" and "valueProviderParameter". I provided these values as "springELValueProvider" and the field I wanted to use for search and facet "manufacturerName". After a solr full indexing, it worked like a charm. No other setting was required. The search and facet both were working as expected.

Lucene wildcard applied to indexed field

I have a set of indexed fields such as these:
submitted_form_2200FA17-AF7A-4E44-9749-79D3A391A1AF:true
submitted_form_2398389-2-32-43242423:true
submitted_form_54543-32SDf-3242340-32422:true
And I get that it's possible to wildcard queries such as
submitted_form_2398389-2-32-43242423:t*e
What I'm trying to do is get "any" submitted form via something like:
submitted_form_*:true
Is this possible? Or will I have to do a stream of "OR"s on the known forms (which seems quite heavy)
That's not the intended use of fields, I think. Field names aren't supposed to be the searchable values, field values are. Field names are supposed to be known a priori.
My suggestion is (if possible) to store the second part of the name as the field value, for instance: submitted_form:2398389-2-32-43242423. submitted_from would be the field known a priori, and the value could eventually be searched with a PrefixQuery.
Anyway, you could access the collection of fields' names using IndexReader.getFieldNames() in Lucene 3.x and this in Lucene 4.x. I wouldn't expect search performance there.

Can data in Solr be extended with manually defined meta data?

I have several documents in a solr collection that I want to be able to search through. Most of the data comes from web sites I can easily crawl, however, I need to add some attributes manually to because I have to add these attributes manually.
So as an example I get the following info from a site (all attributes returned from crawled site):
Name: Porsche Boxter
Year: 1996
...
I want to add additional fields through a web interface (info not present on crawled sites):
Cool: yes
foo: bar
My questions:
Does it make sense at all to store additional information along the indexed data within Solr (inside the documents) or would a best practice only have all crawled data in Solr and merge with an external managed database during query time? To me it makes more sense to have all my data that is eventually queried in Solr as some of the manually added attributed are required search criteria (e.g. look only for cool cars from the 90s).
Is it possible to use Solr to store additional information about indexed documents? I know the entire schema in advance, perhaps this is useful?
If I store my data exclusively in Solr, how can I ensure that during the next crawl the manually added data is not overwritten? Would partial update be required?
Since I am new to Solr it would also be very helpful if someone could simply manage what to look for in the documentation that describes my use case.
That depends on how often the external data changes. The more often, the less meaningful. Generally it is a good idea to store such data along the index data, because you get them without an additional database query.
Yes. Use indexed:falseand stored:true. If you knew not know all of such fields in advance you could use a dynamicField like <dynamicField name="*_stored" type="string" indexed="false" stored="true" />.
Yes. You have to use partial update. This is no problem in your case, because the fields not updated have stored:true.

Solr conditional adds/updates?

I have a fairly simple need to do a conditional update in Solr, which is easily accomplished in MySQL.
For example,
I have 100 documents with a unique field called <id>
I am POSTing 10 documents, some of which may be duplicate <id>s, in which case Solr would update the existing records with the same <id>s
I have a field called <dateCreated> and I would like to only update a <doc> if the new <dateCreated> is greated than the old <dateCreated> (this applies to duplicate <id>s only, of course)
How would I be able to accomplish such a thing?
The context is trying to combat race conditions resulting in multiple adds for the same ID but executing in the wrong order.
Thanks.
I can think of two ways:
Write your own UpdateHandler and override addDoc to implement that checking.
Put the appropriate locks (critical sections) in your client code in order to fetch the stored document, compare the dates, and conditionally add the new document in a thread-safe manner.
Remember that Solr is not a database, comparing it to MySQL is comparing apples and oranges.
As of solr 4.0, optimistic concurrency is enabled via the _version_ field.
http://yonik.com/solr/optimistic-concurrency/
To enable, you need to make sure your schema.xml contains
<field name="_version_" type="long" indexed="true" stored="true"/>
and in solrconfig.xml
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.data.dir:}</str>
</updateLog>
</updateHandler>
With really custom addition logic like this, I find that writing your own client side updater works better. It keeps you from mucking around in Solr internals, which makes it easier to update in the future. You can definitly do this in SolrJ, but if you aren't a Java dev, there is probably a clientside library in your own preferred language... PHP, Python, Ruby, C# etc...
The rsolr Ruby gem (http://github.com/mwmitchell/rsolr/tree/master) makes it VERY easy to hack together a custom load script.

Resources