Solr - index JSON query string from database? - search

I would like to know if it is possible to index data that contains a JSON string that can be decoded and each JSON value to be indexed with the separate values.
I am using the DIH to connect to a MySQL database and able to index the individual columns.
The result would look like the following:
<response name="response" numFound="1" start="0" maxScore="2.7143538">
...
<result name="response" numFound="1" start="0" maxScore="2.7143538">
<doc>
<float name="score">2.7143538</float>
<str name="id">82</str>
<str name="name">jorge</str>
<str name="otherinfo">{"day":15,"year":1989,"month":"January"}</str>
</doc>
</result>
</response>
The problem is that "otherinfo" is a JSON string that I would like to decode and have something like the following in my index:
<response name="response" numFound="1" start="0" maxScore="2.7143538">
...
<result name="response" numFound="1" start="0" maxScore="2.7143538">
<doc>
<float name="score">2.7143538</float>
<str name="id">82</str>
<str name="name">jorge</str>
<str name="day">15</str>
<str name="year">1989</str>
<str name="month">January</str>
</doc>
</result>
</response>
Would this be possible to do at all with Solr?
Thanks in advance

I commented on this. I decided that I should answer instead.
The fix for your issue isn't at the Solr level. You shouldn't be storing your data this way in the DB to begin with. In the long run, it would be better to fix this problem there, as opposed to trying to hack this at the Solr indexing level.
Your question proves that someone, probably an end user, is interested in searching by this data. This implies that it should probably be stored in the database as an actual Date or Timestamp field so that it can be properly selected or sorted on.
I'm sure people won't like that this doesn't exactly answer your question, but someone needs to tell you this.

If you know your way around Java you could write your own, custom transformer that would handle your specific case.
Have you tried using DIH RegexTransformer to parse JSON?
I think that should be doable, especially if you have fixed json format (doesn't contain document in document in document in ...).
I've just noticed ScriptTransformer, which allows you to write your own parser. I think this is the way to go...

Is the otherinfo field in the DB a JSON string to start with?
You would need dynamic fields (docs, explanation) and client-side code to let Solr store data with arbitary schema.
You would need to define dynamic fields in your schema like:
dyn_string_*: store text as it is
dyn_text__*: store text and index it for search
etc
Then you will need to tell DIH to map DB fields to solr dynamic fields (pseudocode warning; sorry, but I am not familiar with DIH):
Select
day as dyn_number_day,
name as dyn_text_name
from
tablename
Edit
You do have the requirement to query into the data structure. This needs a schema-less datastore.
Document DBs like MongoDB offer exactly the functionality: store data on arbitary fields you determine at insert-time. And it can run any kind of ad-hoc query on your data.
I am not aware of a request handler that can index your data for that. You can write code that fetches updated (or added or removed) rows periodically, decodes the JSON field and index it to Solr.
I reccomend skinny data model to store attributes to properties independent of current DB schema. I asked a question ' Set intersection in MySQL: a clean way ' a while back.
Recap: MongoDB and friends contain exactly the functionality you need. If you want relations and referential integrity, you can keep using RDBMS. If you still want that JSON thing, develop an active system that will parse it and index it to solr. But I recommend moving to a skinny data model, since you can get the same (conditions apply!) query capabilities that Solr gives you by SQL.
Exotic technology: Graph databases like Neo4j contain document database functionality (ad-hoc queries) and relations: a relation directly links one node to another, no joins involved. So it's just one step short of referential integrity.

Related

Solr conditional query fields (qf)

Is it possible to define query fields in Solr based on certain conditions? For e.g. I've three fields text, title and product.The solr config definition:
<str name="qf">text^0.5 title^10.0 Product</str>
What I'm looking here is to include "product" as a searchable field only when certain condition is met, for e.g. if author:"Tom", then search in Product as well.
Is there a way to do that during query time using edismax ?
The alternate I've is to add the product information to either text or title of the document (where author=Tom) during index time so it'll be searchable. But, I'm trying to avoid this if possible.
Any pointers will be appreciated.
-Thanks
In order to search in different fields based on different conditions, there is a need to first search for that specific conditions, thus it is more or less the same as issuing multiple queries.
That said, in case there is a need to do it as a one-time query (e.g. for out-of-the-box sorting/grouping/other solr features), the nested queries can be used.
For defining two different conditions (as in the original question, but it can easily be extended with more OR clauses), the q parameter can receive following value:
_query_:"{!edismax fq=$fq1 qf=$qf1 v=$condQuery}"
OR
_query_:"{!edismax fq=$fq2 qf=$qf2 v=$condQuery}"
The query uses Parameter Dereferencing, so there is no need to manually escape any special characters before passing the parameters to solr.
fq1 - first special condition
qf1 - list of fields to search in for first special condition (fq1)
fq2 - second special condition
qf2 - list of fields to search in for first special condition (fq2)
condQuery - the actual search term/query
The fq1 may be empty in order to define a baseline (in this particular case - search in text and title, but not in product).
The raw parameters themselves will look the following way:
fq1=&qf1=text^0.5 title^10.0&fq2=author:"Tom"&qf2=text^0.5 title^10.0 Product&condQuery=5
And the Final query will be something like this:
http://localhost:8983/solr/collection1/select?q=_query_%3A%22%7B!edismax+fq%3D%24fq1+qf%3D%24qf1+v%3D%24condQuery%7D%22+OR+_query_%3A%22%7B!edismax+fq%3D%24fq2+qf%3D%24qf2+v%3D%24condQuery%7D%22&fl=*%2Cscore&wt=xml&indent=true&fq1=&qf1=text^0.5%20title^10.0&fq2=author:%22Tom%22&qf2=text^0.5%20title^10.0%20Product&condQuery=5
.. or the same query returned by solr in solr response (provided only for showing it in a structured way):
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="q">_query_:"{!edismax fq=$fq1 qf=$qf1 v=$condQuery}" OR _query_:"{!edismax fq=$fq2 qf=$qf2 v=$condQuery}"</str>
<str name="condQuery">5</str>
<str name="indent">true</str>
<str name="fl">*,score</str>
<str name="fq1"/>
<str name="qf1">text^0.5 title^10.0</str>
<str name="fq2">author:"Tom"</str>
<str name="qf2">text^0.5 title^10.0 Product</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="..." start="..." maxScore="...">
...
</result>
</response>
Even though it works, I suggest to consider the effect it would have on query time (as each condition will have a separate internal search query) and measure how it affects your specific case.
I didn't try it by myself, but it looks like this could be achievable by using http://wiki.apache.org/solr/FunctionQuery#Boolean_Functions
Shamik,
I don't think there is a way to do this in Solr that is easy. One thing to consider is managing of these rules overtime too, it would be a nightmare for a large system.
If you really wanted to do something like this, maybe you can issue two calls to Solr to get the result set.

Can data in Solr be extended with manually defined meta data?

I have several documents in a solr collection that I want to be able to search through. Most of the data comes from web sites I can easily crawl, however, I need to add some attributes manually to because I have to add these attributes manually.
So as an example I get the following info from a site (all attributes returned from crawled site):
Name: Porsche Boxter
Year: 1996
...
I want to add additional fields through a web interface (info not present on crawled sites):
Cool: yes
foo: bar
My questions:
Does it make sense at all to store additional information along the indexed data within Solr (inside the documents) or would a best practice only have all crawled data in Solr and merge with an external managed database during query time? To me it makes more sense to have all my data that is eventually queried in Solr as some of the manually added attributed are required search criteria (e.g. look only for cool cars from the 90s).
Is it possible to use Solr to store additional information about indexed documents? I know the entire schema in advance, perhaps this is useful?
If I store my data exclusively in Solr, how can I ensure that during the next crawl the manually added data is not overwritten? Would partial update be required?
Since I am new to Solr it would also be very helpful if someone could simply manage what to look for in the documentation that describes my use case.
That depends on how often the external data changes. The more often, the less meaningful. Generally it is a good idea to store such data along the index data, because you get them without an additional database query.
Yes. Use indexed:falseand stored:true. If you knew not know all of such fields in advance you could use a dynamicField like <dynamicField name="*_stored" type="string" indexed="false" stored="true" />.
Yes. You have to use partial update. This is no problem in your case, because the fields not updated have stored:true.

Solr: How to specify field relevancy/weight

I'm currently indexing data using Solr that consists of about 10 fields. When I perform a search I would like certain fields to be weighted higher. Could anyone help point me in the right direction?
For example, searching across all fields for a term such as "superman" should return hits in the "Title" field before the "Description" field.
I've found documentation on how to make one field score higher from the query, but I would prefer to set this in a configuration file or similar. The following would require all searches to specify the weight. Is it possible to specify this in the solr config file?
q=title:superman^2 description:superman
Try using qf with ExtendedDisMax your query then would look like that:
q=superman
While your config will look like:
<str name="qf">title^2 description</str>
You can get some working examples here
The qf parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field's importance in the query. For example, the query below:
qf="fieldOne^2.3 fieldTwo fieldThree^0.4"
Assigns fieldOne a boost of 2.3, leaves fieldTwo with the default boost (because no boost factor is specified), and fieldThree a boost of 0.4. These boost factors make matches in fieldOne much more significant than matches in fieldTwo, which in turn are much more significant than matches in fieldThree."
Source: Apache Lucene
In your case: qf="title^100 description" may do the trick - if you're using Solr in a library I'd love to chat.
By using edismax we can achieve what you looking for.
Try adding these two fields in your request handler by changing the fields.
You can remove a particular field completely, if you don't want it.
<str name="defType"> edismax </str>
<str name="qf"> YourField^50 YourAnotherField^30 YetAnotherField</str>
The more the power(^) increases, the more priority that field gets.

solr - modeling multiple values on 1:n connection

I try to model my db using this example from solr wiki.
I have a table called item and a table called features with id,featureName,description
here is the updated xml (added featureName)
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document>
<entity name="item" query="select * from item">
<entity name="feature" query="select description, featureName as features from feature where item_id='${item.ID}'"/>
</entity>
</document>
Now I get two lists in the xml element
<doc>
<arr name="featureName">
<str>number of miles in every direction the universal cataclysm was gathering</str>
<str>All around the Restaurant people and things relaxed and chatted. The</str>
<str>- Do we have... - he put up a hand to hold back the cheers, - Do we</str>
</arr>
<arr name="description">
<str>to a stupefying climax. Glancing at his watch, Max returned to the stage</str>
<str>air was filled with talk of this and that, and with the mingled scents of</str>
<str>have a party here from the Zansellquasure Flamarion Bridge Club from</str>
</arr>
</doc>
But I would like to see the list together (using xml attributes) so that I dont have to join the values.
Is it possible?
I wanted to suggest the ScriptTransformer, it gives you the flexibility to alter the data as needed, but it will not work in your case since it's working at the row level.
You can always define an aggregation function for string concatenation in SQL(example), but you will potentially have performance issues.
If you would use a http/xml data source the solution would have been to use the flatten atribute.
Nevertheless the search functionality will work as expected even if you ended up with multi-valued fields. The down side would be on the client where you will concatenate them before the presentation layer, which is not really a problem if you use some sort of pagination.

Solr conditional adds/updates?

I have a fairly simple need to do a conditional update in Solr, which is easily accomplished in MySQL.
For example,
I have 100 documents with a unique field called <id>
I am POSTing 10 documents, some of which may be duplicate <id>s, in which case Solr would update the existing records with the same <id>s
I have a field called <dateCreated> and I would like to only update a <doc> if the new <dateCreated> is greated than the old <dateCreated> (this applies to duplicate <id>s only, of course)
How would I be able to accomplish such a thing?
The context is trying to combat race conditions resulting in multiple adds for the same ID but executing in the wrong order.
Thanks.
I can think of two ways:
Write your own UpdateHandler and override addDoc to implement that checking.
Put the appropriate locks (critical sections) in your client code in order to fetch the stored document, compare the dates, and conditionally add the new document in a thread-safe manner.
Remember that Solr is not a database, comparing it to MySQL is comparing apples and oranges.
As of solr 4.0, optimistic concurrency is enabled via the _version_ field.
http://yonik.com/solr/optimistic-concurrency/
To enable, you need to make sure your schema.xml contains
<field name="_version_" type="long" indexed="true" stored="true"/>
and in solrconfig.xml
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.data.dir:}</str>
</updateLog>
</updateHandler>
With really custom addition logic like this, I find that writing your own client side updater works better. It keeps you from mucking around in Solr internals, which makes it easier to update in the future. You can definitly do this in SolrJ, but if you aren't a Java dev, there is probably a clientside library in your own preferred language... PHP, Python, Ruby, C# etc...
The rsolr Ruby gem (http://github.com/mwmitchell/rsolr/tree/master) makes it VERY easy to hack together a custom load script.

Resources