Recently, I am trying to adopt Solr to search rich document files(e.g. .pdf, .doc, xls ...etc)
When I try to import all the files from the disk using Solr admin UI (localhost:18983/solr/#/local.info/dataimport//dataimport), the message always shows "Index Completed" but no document added/updated.
Data Import Messages Screenshot
I have also checked the official online manual to index a directory of rich files(lucene.apache.org/solr/quickstart.html#indexing-a-directory-of-rich-files).
The error messages showed
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: localhost:8983/solr/local.info/update/extract?resource.name=%2Fvar%2Fsolr%2Fdata%2Flocal.info%2Frich_documents%2FNEWS.PDF&literal.id=%2Fvar%2Fsolr%2Fdata%2Flocal.info%2Frich_documents%2FNEWS.PDF
SimplePostTool: WARNING: Response:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">71</int>
</lst>
<lst name="error">
<str name="msg">
Invalid UUID String: '/var/solr/data/local.info/rich_documents/NEWS.PDF'</str>
<int name="code">400</int></lst>
</response>
Here are my configs
data-config.xml, solrconfig.xml, schema.xml
Configs Link
Anyone has idea to fix this problem?
Thanks
Related
In my database the dates are like 1973-01. They are stored as string value. If I have to index this using Apache Solr then how would I do it.
I have written the below in my schema.xml:
<field name="pubdate" type="tdate" indexed="true" stored="true" multiValued="false" />
I have also changed all the dates like 1973-01Z. But I am still getting an error:
org.apache.solr.common.SolrException: Invalid Date in Date Math String:'1973-01Z'
I believe Solr only accepts date like 1995-12-31T23:59:59Z
Can anyone help?
In solrconfig.xml you can define the date formats your update request handler can process inside an updateRequestProcessorChain with the help of a ParseDateFieldUpdateProcessorFactory:
<updateRequestProcessorChain name="parse-field-types">
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>
<processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>
<processor class="solr.ParseLongFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDateFieldUpdateProcessorFactory">
<!-- A default time zone name or offset may optionally be specified for those
dates that don't include an explicit zone/offset.
-->
<str name="defaultTimeZone">Europe/Berlin</str>
<arr name="format">
<str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>
<str>yyyy-MM-dd'T'HH:mm:ssZ</str>
<str>yyyy-MM-dd HH:mm:ss Z</str>
<str>yyyy-MM-dd HH:mm:ss</str>
<str>yyyy-MM-dd HH:mm:ss 'UTC</str>
</arr>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
Then you have to connect the updateRequestProcessorChain with the update request handler
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">parse-field-types</str>
</lst>
</requestHandler>
Maybe you can define a format here that is working for you.
In solr wiki this phrase can be found:
To enable dynamic core configuration, make sure the adminPath attribute is set in solr.xml. If this attribute is absent, the CoreAdminHandler will not be available.
In old style solr.xml this attribute sets in cores element:
cores adminPath="/admin/cores"
In new (discovery) style solr.xml (available since solr 4.4 and mandatory since coming 5th) there is no cores element to set and no any notion about adminPath attribute around. As a result, if to check localhost:8983/solr, error occurs:
NetworkError: 404 Not Found - http://localhost:8983/solr/admin/cores?wt=json&indexInfo=false
Does all this mean dynamic core handling via HTTP is unavailable in 4.4+ solr or I missed to set something in configs?
Thanks in advance.
Edit solr.xml
<solr>
<str name="adminHandler">${adminHandler:org.apache.solr.handler.admin.CoreAdminHandler}</str>
<int name="coreLoadThreads">${coreLoadThreads:3}</int>
<str name="coreRootDirectory">${coreRootDirectory:#SOLR.CORES.DIRECTORY#}</str>
<str name="managementPath">${managementPath:}</str>
<str name="sharedLib">${sharedLib:}</str>
<str name="shareSchema">${shareSchema:false}</str>
<solrcloud>
<int name="distribUpdateConnTimeout">${distribUpdTimeout:1000000}</int>
<int name="distribUpdateSoTimeout">${distribUpdateTimeout:1000000}</int>
<int name="leaderVoteWait">${leaderVoteWait:1000000}</int>
<str name="host">${host:}</str>
<str name="hostContext">${hostContext:solr}</str>
<int name="hostPort">${jetty.port:8983}</int>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
</solrcloud>
<logging>
<str name="class">${loggingClass:}</str>
<str name="enabled">${loggingEnabled:}</str>
<watcher>
<int name="size">${loggingSize:1000000}</int>
<int name="threshold">${loggingThreshold:100000}</int>
</watcher>
</logging>
</solr>
Im trying to do some work on my server but running into problems. When I try to ping the server through the admin panel I get this error, which I believe might be causing the problem:
The server encountered an internal error (Ping query caused exception:
undefined field text org.apache.solr.common.SolrException: Ping query
caused exception: undefined field text at
org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:76)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
Can anyone give me a bit of guideance as to what might be going wrong? I'm using Solr 3.6. I think it may be to do with the defined "text" in the schema.xml??
This is my schema currently: https://gist.github.com/3689621
Any help would be much appreciated.
James
Based on the error, I am guessing that the query that is defined in the /admin/ping requestHandler is searching against a field named text, which you do not have defined in your schema.
Here is a typical ping requestHandler section
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
<str name="q">solrpingquery</str>
</lst>
<lst name="defaults">
<str name="qt">standard</str>
<str name="echoParams">all</str>
<str name="df">text</str>
</lst>
</requestHandler>
Note how the <str name="df">text<str> setting. This is the default field that the ping will execute the search against. You should change this to a field that is defined in your schema, perhaps, title or description based on your schema.
Add this line in your schema.xml
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
I would like to know how it is possible to get different scores for a multiple terms search result?
Certain results in solr have the same score even when there are multiple terms in the query as you will see in the example below.
I have two indexes in Solr, each containing: id, first_name, last_name
Each index would look like the following:
<doc>
<str name="id">1</str>
<str name="last_name">fisher</str>
<str name="name">john</str>
</doc>
<doc>
<str name="id">2</str>
<str name="last_name">darby</str>
<str name="name">john</str>
</doc>
When I query just "john" both results come up. That is perfect.
However, when I query "john fisher" both results come up but with the same score.
What I want is different scores based on the relevancy of the search terms.
Here is the result for the following query
http://localhost:8983/solr/select?q=john+fisher%0D%0A&rows=10&fl=*%2Cscore
<response>
...
<result name="response" numFound="2" start="0" maxScore="0.85029894">
<doc>
<float name="score">0.85029894</float>
<str name="id">1</str>
<str name="last_name">fisher</str>
<str name="name">john</str>
</doc>
<doc>
<float name="score">0.85029894</float>
<str name="id">2</str>
<str name="last_name">darby</str>
<str name="name">john</str>
</doc>
</result>
</response>
Any help would be greatly appreciated
Your best bet is to understand & analyse how different factors affect your document score, Lucene has helpful feature Explanation, Solr leverage this to provide how scoring is calculated you can use 'debugQuery' in Solr to see how it is derived,
?q=john&fl=score,*&rows=2&debugQuery=on
Ex Response:
<lst name="debug">
<str name="rawquerystring">john</str>
<str name="querystring">john</str>
<str name="parsedquery">+DisjunctionMaxQuery((text:john))</str>
<str name="parsedquery_toString">+(text:john)</str>
<lst name="explain">
<!-- Score calulation for Result#1 -->
<str>
2.1536596 = (MATCH) fieldWeight(text:john in 36722), product of:
1.0 = tf(termFreq(text:john)=1)
8.614638 = idf(docFreq=7591, maxDocs=15393998)
0.25 = fieldNorm(field=text, doc=36722)
</str>
<!-- Score calulation for Result#2 -->
<str>
2.1536596 = (MATCH) fieldWeight(text:john in 36724), product of:
1.0 = tf(termFreq(text:john)=1)
8.614638 = idf(docFreq=7591, maxDocs=15393998)
0.25 = fieldNorm(field=text, doc=36724)
</str>
</lst>
besides this, you can use explainOther to find out how a certain document did not match the query.
?q=john&fl=score,*&rows=2&debugQuery=on&explainOther=on
Do Read:
Solr Relevancy
Lucene Scoring
It looks to me that you are only searching on the "name" field. Thats why the scores are the same. If you use DisMax you can easily search on both fields and the most relevant will have a higher score.
e.g.
<str name="defType">edismax</str>
<str name="qf">name last_name</str>
Another way is to combine the 2 fields into 1 field with copyField and only search in the newly created field.
Thanks for the quick reply guys, I appreciate that.
From the explain query I was able to identify that indeed the search was only been performed on one field alone.
I saw that it is possible to add multiple fields to the same field for searching.
In the schema.xml I added the following:
<copyField source="last_name" dest="text"/>
The results now come up as expected when using more than one search term.
I am searching over 6 Solr shards (Solr version 3.5). What I recognized is that when I am doing the search in my normal standalone instance, which contains the same data I get 2 facet_fields in the facet_counts section. This is was I except:
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="url">...</lst>
<lst name="url">...</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
As you can see there are 2 facet_fields. When I am doing the same query using multiple shards (same data), I am getting always just one facet_field:
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="url">...</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
I am also using tagging and excluding filters in my Query. Could this be the problem?
Thanks to Yonik Seeley from the solr-user mailing list the solution was to add some output keys to the the facets.
See also http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters