Search / indexing on small binary fields in SOLR - search

Need to index small binary strings with SOLR but failed to do so. Actually I'm trying to search on hashes like SHA-1, MD5 and things like UUID.
Have binary field intended to be indexed.
<field name="fi" type="binary" indexed="true" stored="true" required="false" multiValued="false" />
Have binary type definition.
<fieldtype name="binary" class="solr.BinaryField"/>
Why any try to select on this field even with fi:* request cannot find anything? Any alternative to my approach?

if your data is just SHA1 etc, I think you can perfectly make this work with a StrField. Of course if you need prefix searches be sure to properly analyze it with solr.EdgeNGramTokenizerFactory.
Regarding the binary field you are using, I never had to use it myself, but what apparently does is encode in base64 whatver you send, and index it then, so you can reallly send binary data (like an .exe file).

Related

Querying on StrFields with large values returns no documents in Solr 4.4

I'm having trouble querying on a StrField with a large value (say 70k characters). I'm using Solr 4.4 and have documents with a string type:
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
and field:
<dynamicField name="someFieldName_*" type="string" indexed="true" stored="true" />
Note that it's stored, in case that matters.
Across my documents, the length of the value in this StrField can be up to ~70k characters or more.
The query I'm trying is someFieldName_1:*. If someFieldName_1 has values with length < 32,767 characters, then it works fine and I get back various documents with values in that field.
However, if I query someFieldName_2:* and someFieldName_2 has values with length >= 32,767, I don't get back any documents. Even though I know that many documents have a value in someFieldName_2.
I know this because I query *:* and see documents with (large) values in someFieldName_2.
So is there some type of limit to the length of strings in a StrField that I can query against? 32,767 = 2^15 is mighty suspicious =)
Yonik answered this question on the Solr user mailing list with, "I believe that's the maximum size of an indexed token...". So it seems like the behavior is somewhat expected.
However, another user has opened up a bug report about the lack of errors, "i'll open a bug to figure out why we aren't generating an error for this at index time, but the behavior at query time looks correct..."

solr index for multi-valued multi-type field

I am indexing a collection of xml document with the next structure:
<mydoc>
<id>1234</id>
<name>Some Name</name>
<experiences>
<experience years="10" type="Java"/>
<experience years="4" type="Hadoop"/>
<experience years="1" type="Hbase"/>
</experiences>
</mydoc>
Is there any way to create solr index so that it would support the next query:
find all docs with experience type "Hadoop" and years>=3
So far my best idea is to put delimited years||type into multiValued string field, search for all docs with type "Hadoop" and after that iterate through the results to select years>=3. Obviously this is very inefficient for a large set of docs.
I think there is no obvious solution for indexing data coming from the many-to-many relationship. In this case I would go with dynamic fields: http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
Field definition in schema.xml:
<dynamicField name="experience_*" type="integer" indexed="true" stored="true"/>
So, using your example you would end up with something like this:
<mydoc>
<id>1234</id>
<name>Some Name</name>
<experience_Java>10</experience_Java>
<experience_Hadoop>4</experience_Hadoop>
<experience_Hbase>1</experience_Hbase>
</mydoc>
Then you can use the following query: fq=experience_Java:[3 to *]

Why did they create the concept of "schema.xml" in Solr?

Lucene does searching and indexing, all by taking "coding"... Why doesn't Solr do the same ? Why do we need a schema.xml ? Whats its importance ? Is there a way to avoid placing all the fields we want into a schema.xml ? ( I guess dynamic fields are the way to go, right ? )
That's just the way it was built. Lucene is a library, so you link your code against it. Solr, on the other hand, is a server, and in some cases you can just use it with very little coding (e.g. using DataImportHandler to index and Velocity plugin to browse and search).
The schema allows you to declaratively define how each field is analyzed and queried.
If you want a schema-less server based on Lucene, take a look at ElasticSearch.
If you want to avoid constantly tweaking your schema.xml, then dynamic fields are indeed the way to go. For an example, I like the Sunspot schema.xml — it uses dynamic fields to set up type-based naming conventions in field names.
https://github.com/outoftime/sunspot/blob/master/sunspot/solr/solr/conf/schema.xml
Based on this schema, a field named content_text would be parsed as a text field:
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/>
Which corresponds to its earlier definition of the text fieldType.
Most schema.xml files that I work with start off based on the Sunspot schema. I have found that you can save a lot of time by establishing and reusing a good convention in your schema.xml.
Solr acts as a stand-alone search server and can be configured with no coding. You can think of it as a front-end for Lucene. The purspose of the schema.xml file is to define your index.
If possible, I would suggest defining all your fields in the schema file. This gives you greater control over how those fields are indexed and it will allow you to take advantage of copy fields (if you need them).

Solr conditional adds/updates?

I have a fairly simple need to do a conditional update in Solr, which is easily accomplished in MySQL.
For example,
I have 100 documents with a unique field called <id>
I am POSTing 10 documents, some of which may be duplicate <id>s, in which case Solr would update the existing records with the same <id>s
I have a field called <dateCreated> and I would like to only update a <doc> if the new <dateCreated> is greated than the old <dateCreated> (this applies to duplicate <id>s only, of course)
How would I be able to accomplish such a thing?
The context is trying to combat race conditions resulting in multiple adds for the same ID but executing in the wrong order.
Thanks.
I can think of two ways:
Write your own UpdateHandler and override addDoc to implement that checking.
Put the appropriate locks (critical sections) in your client code in order to fetch the stored document, compare the dates, and conditionally add the new document in a thread-safe manner.
Remember that Solr is not a database, comparing it to MySQL is comparing apples and oranges.
As of solr 4.0, optimistic concurrency is enabled via the _version_ field.
http://yonik.com/solr/optimistic-concurrency/
To enable, you need to make sure your schema.xml contains
<field name="_version_" type="long" indexed="true" stored="true"/>
and in solrconfig.xml
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.data.dir:}</str>
</updateLog>
</updateHandler>
With really custom addition logic like this, I find that writing your own client side updater works better. It keeps you from mucking around in Solr internals, which makes it easier to update in the future. You can definitly do this in SolrJ, but if you aren't a Java dev, there is probably a clientside library in your own preferred language... PHP, Python, Ruby, C# etc...
The rsolr Ruby gem (http://github.com/mwmitchell/rsolr/tree/master) makes it VERY easy to hack together a custom load script.

search in all fields for multiple values?

i have two fields:
title
body
and i want to search for two words
dog
OR
cat
in each of them.
i have tried q=*:dog OR cat
but it doesnt work.
how should i type it?
PS. could i enter default search field = ALL fields in schema.xml in someway?
As Mauricio noted, using a copyField (see http://wiki.apache.org/solr/SchemaXml#Copy_Fields) is one way to allow searching across multiple fields without specifying them in the query string. In that scenario, you define the copyField, and then set the fields that get copied to it.
<field name="mysearchfield" type="string" indexed="true" stored="false"/>
...
<copyField source="title" dest="mysearchfield"/>
<copyField source="body" dest="mysearchfield"/>
Once you've done that, you could do your search like:
q=mysearchfield:dog OR mysearchfield:cat
If your query analyzer is setup to split on spaces (typical), that could be simplified to:
q=mysearchfield:dog cat
If "mysearchfield" is going to be your standard search, you can simplify things even further by defining that copyField as the defaultSearchField in the schema:
<defaultSearchField>mysearchfield</defaultSearchField>
After that, the query would just become:
q=dog cat

Resources