coldfusion 10 moving to solr from verity - search

We are moving from coldfusion 9 verity to coldfusion 10 solr in an application that searches PDF files that have metadata attached to them. We get very different results in testing. We eventually found that solr is searching the file contents very differently than verity. Is there something we can "tweak" to make solr searching more efficient or get them to search the same? We sometimes get very different results searching on a single word, not just multiple words.
Edit: I found that the PDFs I was using were mostly really old and after doing a batch save in Acrobat to resave them as another version, I got much better results. That and the default operator in Verity was AND and SOLR is OR, so I changed that in the jetty config file under my collection's folder. This all helped, but other than that, I'm getting little differences here and there. I hope this helps or helps anybody else that sees this post. Not sure what else I can do to really "tune" it.

Related

Shopware Product Export Feed for Doofinder plugin

I have catalog setup under Shopware, and I have installed doofinder plugin for the search purpose. So, now I need to provide the feed urls in my doofinder and the feeds are setup as well. But, one of the feed does not generate the xml correctly. It is trying to export around 18k records. While, there is one another feed that exports around 74k records.
Can someone please throw me pointers, what can be the probable cause and solution? I am newbie to shopware, etc.
First of, which version of Shopware are you using? (5.6.7 or 6.2.2?)
For now I assume you are using 5.6.7 (or anything along the 5.6 line).
How do you generate the feeds?
Have you checked the logs in var/log/? Maybe there is an kinda obvious error.

How to search newsgroups in gnus

I have gnus working for multiple email addresses with searching via
(nnir-search-engine imap)
I have newsgroup reading setup and working fine too, however, I have never been able to get searching in newsgroups working even though I have
(setq gnus-select-method '(nntp "news.gmane.org"
(nnir-search-engine gmane)))
With the latter, with my cursor on a gname newsgroup, I expect to be able to do G G enter a search and have it return a list of hits as it does with imap search. Instead, I get the message
Contacting host: search.gmane.org:80
open-network-stream: search.gmane.org/80 nodename nor servname provided, or not known
in my mini and Messages buffers.
Any idea what is going on and how to rectify this?
One thought I had was perhaps that I needed to utilize gnu-agent and an agent category to allow me to download messages via J s (all of which I did set up, but haven't fully understood where it is saving, etc.).
Everything else works great in gnus, I just want to search newsgroups too in gnus.
p.s. I have downloaded Unison, which is quite nice and free now, and it can do what I need, but I still hope to do it in gnus.
The gmane search engine does not work because gmane has undergone some changes: gmane search has gone bust for the last two years (?) or so, ever since Lars decided that he was not going to continue with gmane. Although the people who took over brought the nntp service back, search is still missing.
There are other search engines however: the gnus manual lists swish++, swish-e, namazu, notmuch and hyrex (obsolete). I have no idea how well each works: I do know that they require configuration (imap search and gmane search (before it broke), worked right out of the box).
The doc has very few details on the rest, but it does describe how to set up namazu: it requires that you create and maintain index files, presumably indexing a set of local files. The doc's emphasis is on indexing local email, but presumably it would work similarly for downloaded local news articles.

Where to replace example schema?

I am using TYPO3 7x and want to use SOLR search. I created a core using SOLR Admin UI but getting an error of "unsupported schema". please see screenshots.. I am totally new to SOLR and TYPO3, please help.. Screenshot of Error,
Screenshot of core in SOLR Admin
The Solr layout is usually a home that contains solr.xml file and under that one or more collection/core directories, each containing core.properties file.
In your Solr screenshot, instanceDir is that collection/core location and it is /var/solr/data/core_en. The schema.xml and solrconfig.xml then live under the conf directory within that.
Notice that schema.xml would most likely be renamed managed-schema on the first run (at least with default solrconfig.xml). When you are copying your configs over, make sure to remove old managed-schema as well.
Now, one complication is that your Typo3 seems to be looking for it under /solr/core_en, which is not the same path as /var/solr/data/core_en. I would double check what you have in /solr/core_en as well, perhaps you have two conflicting installation or similar.
After you replace the configs, restart Solr or - at least - reload the core in Admin UI's core screen.
Thanx for your reply. There was no problem with the paths. When i was making the core, I moved conf folder from /typo3conf/ext/solr/Resources/Private/Solr/configsets/ext_solr_6_1_0 to /var/solr/data/core_en but forgot to move typo3lib folder. It worked now. but search is still not working. Whenever i search somthing it says "Nothing found for "here is the term I search". " . In the next step when i index in search module, It gives me "Manual Indexing
An error occurred while indexing manual from the backend". screenshot 1, and when I click show error it gives this.
I'm taking help from docs.typo3.org/typo3cms/extensions/solr/Backend/ConnectionManager.html
I appreciate your help.

Search Algorithm for a web application that needs to look for a specific value

I'm developing a webapp that will need to download the html form a website and then iterate through the code and try to find a specific but ever changing value (in our case it will be the price for the product).
For this, I was thinking about asking the user (upon installation and setup) to provide the system with a few lines of html from the page (that has the price) and then from then on, every time we need to fetch the price we would try to search for those lines and find the price.
Now, I believe this is a horrible and slow way of doing this and since there are no rules and the html can be totally different from one website to another (even the same website might change) I couldn't find a better way.
One improvement that I thought about was to iterate through the first time and record the line at which we find the code. Once found, the subsequent times we would then start from a few lines before the expected location and start the search. Any Thoughts on how I can improve on this?
I posted this question on https://cstheory.stackexchange.com/ but they commented that it's not on topic and that I should post it here.
I have the code for the above and if needed I can post it, I'm simply thinking that there must be a better, faster way of doing this.
This is actually something I tried for a project recently (using BeautifulSoup and Python). The solution that worked for me was to workout CSS selectors (which can map to jQuery selectors) that targeted the elements that contained the values I was looking for. In my case I was able to narrow down the full document to just the elements that contained what I was looking for but if you couldn't get exactly what you where after you could combine this with some extra lactic like test to see if it looks like a price (via regex) or test what it is next to.

Solr behind Drupal returns too many results for specific query

We've got Solr sat behind one of our client's Drupal 7 websites, and while it's working well, it returns too many results for what should be quite specific queries. (It also has relevance/weighting problems; but I'm hoping that solving this problem will remove the - literally - irrelevant results.)
For example, searching for the phrase 'particular phrase in london' should return the node with that as its title, quite high up; I don't even think that any other content should be returned. But I find that it's returning lots of content, purely on the fact that it mentions "London"!
Frivolously, searching for the ridiculous phrase 'piecrusts in london' returns a lot of results too, apparently just because they mention London. No content on the site mentions actual piecrusts.
When I search for 'particular phrase in london', here are the parameters that end up in the catalina.out log on the server (whitespace added for clarity):
{spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1
&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR
&spellcheck.q=particular+phrase+in+london
&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0
&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200
&facet.date=ds_created
&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR
&f.bundle.facet.mincount=1&hl.fl=content,ts_comments
&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,
label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,
tos_name,tm_node,zs_entity
&start=0&facet.sort=count&f.bundle.facet.limit=50&q=special+phrase+in+london
&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR
&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0
&facet.field=im_field_health_topic&facet.field=bundle
&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50}
hits=1998 status=0 QTime=14
Note that these parameters have been built by Drupal's Apache Solr module; I don't believe we've got any particular custom code of our own that's doing anything to it.
This corresponds to the following URL, if entered directly in the browser:
http://example.com:8081/solr/CLIENT/select?spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1&facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR&spellcheck.q=particular+phrase+in+London&qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0&qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200&facet.date=ds_created&f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR&f.bundle.facet.mincount=1&hl.fl=content,ts_comments&json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,tos_name,tm_node,zs_entity&start=0&facet.sort=count&f.bundle.facet.limit=50&q=particular+phrase+in+London&f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR&bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0&facet.field=im_field_health_topic&facet.field=bundle&f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50
This URL returns nearly 2000 results - that's most of the content on the site! I've experimented with removing each query parameter at a time, and the only one to make any difference seems to be qf and q: if I remove qf, zero results; if I remove q, I get more results back!
I guess there are two questions here:
Is there anything in these parameters that tell Solr "don't worry if 'particular phrase', or 'piecrusts' appears: just collate the results for 'london'" and then order by relevancy? I would add that I think 'in' is mentioned in the stopwords file, so we can probably ignore the effect of that (?)
Or is this something in the (standard Drupal) schema that I need to change?
I appreciate that sometimes search is better for the visitor if it's inclusive; Google does return results even if it doesn't find perfect matches. But, stopwords and stemming aside, the client does require that searches return only results where all words appear in the content.
As mentioned in the post at http://drupal.org/node/1783454, the Apache Solr Search Integration module makes use of the mm param, which is more or less configured to effect rankings by how closely the keywords are in the dataset. Looking through the docs there are other ways you can use the parameter to effect rankings as well. Therefore the results produced by Apache Solr Search Integration are weighted more closely to the AND operator even though it will return more results as you add more keywords. The benefit of this param is that in cases where the user enters keywords that are too restrictive, results will still be returned. Displaying no results is a really quick way to guide people away from your site.
How are you displaying the search ?
Maybe you could solr views to limit the search range ?
http://drupal.org/project/apachesolr_views
thanks
Nick

Resources