I've just set up Solr, indexed some pages (crawled using Nutch) and I can now search.
I now need to change it to index sentences instead of web pages. The result I need is, for example, to do a search for "one word" and get a list of all sentences that contain "one" and/or "word".
I'm new to Solr so any pointers to where I should start from to achieve this would be extremely helpful. Is it at all possible? Or is there an easy way of doing this I've missed?
Yes. Solr indexes 'documents'. You define what a document is by what you post to it via the REST-ful endpoint. If you push one sentence at a time, it indexes one sentence at a time.
If you meant, 'can I push a document, have solr split into sentences and index each one individually', then the answer is, I think, not very easily inside Solr. If you are using Nutch, I'd recommend putting the splitting into Nutch so that it presents solr with one sentence at a time.
Neither the analysis chain nor update request processors provide for splitting a document into littler documents. You might also contemplate the Elastic Search alternative, though I have no concrete knowledge that there's a greased pole to slide down that leads to your solution there.
Related
I'm trying to understand what is the purpose of configuring a different analyzer for searching and indexing in Azure Search. See: https://learn.microsoft.com/en-us/rest/api/searchservice/create-index#-field-definitions-
According to my understanding, the job of the indexing analyzer is to breakup the input document into individual tokens. Through this process, it might apply multiple transformations like lower-casing the content, removing punctuation and white-spaces, and even removing entire words.
If the tokens are already processed, what is the use of the search analyzer?
Initially, I thought it would apply a similar process on the search query itself, but wouldn't setting a different analyzer than the one used to index the document at this stage completely breaks the search results? If the indexing analyzer lower-cased everything, but the search analyzer doesn't lower-case the query, wouldn't that means you'll never get matches for queries with upper case characters? What if the search analyzer doesn't split tokens on white-spaces? Won't you ever get a match the moment the query includes a space?
Assuming that this is indeed how the two analyzers works together, then why would you ever want to set two different ones?
Your understanding of the difference between index and search analyzer is correct. An example scenario where that's valuable is using ngrams for indexing but not for search terms. So this would allow a document with "cat" to produce "c", "ca", "cat" but you wouldn't necessarily want to apply ngrams on the search term as that would make the query less performant and isn't necessary since the documents already produced the ngrams. Hopefully that makes sense!
The case i am facing seems very simple, but truly i can't imagine a clear solution:
Imagine i want to indexed a text containing "Summertime, and the living is easy" on a Lucene Index.
I want that the search on my ui of "summer time" finds the document indexed containing my text with Summertime, while maintaining all the benefits of a StandardAnalyser standard data.
I imagine that using a fuzzyQuery will suffice (since the distance is 1). since the tokenizer i use split based on the spaces, the solution isn't revlevant
I don't know wich analyzer to use to allow this possibility? while keeping all the benefits of a StandardAnalyzer'like (Stopwords, possibility to add synonyms,...).
Maybe it's simpler than i think (at least it seems so), but i really can't see any solution for now .... :(
You can use a ShingleFilter to make Solr combine multiple tokens into one, with a user define separator.
That way you'll get "summer time" as a single token, as well as "summer" and "time" (unless you disable outputUnigrams). When you do this you'll get tokens with a small edit distance, and the fuzzy search should work as you want it to.
Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.
i am using the newest Siren distribution for Solr to index my data and search it. (http://siren.solutions/siren/downloads/)
Is there a simple way to search similar documents in my indexed data. Something similar to the MoreLikeThis query of Solr (https://cwiki.apache.org/confluence/display/solr/MoreLikeThis).
My goal is to find documents that have a similar json structure that the one i am interested in.
best,
Bernd
If I remember SIREn stores the RDF representation of each resource within a dedicated field of the Solr document. I don't think the default MLT component that comes with Solr works for your scenario.
I mean, enabling that component will produce some kind of result but I don't believe that it will follow your json "similarity" requirement.
On top of that I suggest you to post your request on SIREn mailing list [1]: I'm sure the dev team will address you on the right path.
[1] https://groups.google.com/forum/m/#!forum/siren-user
I have a sphinx server to index a mysql database for a django app. My search is working fine but my content includes medical words/phrases. So, for example, I need a search for "dvt" to also match against "deep venous thrombosis" and even "deep vein thrombosis". I looked through the documentation and see an option for "wordforms" and "morphology". Which of these (or something else) should I use? Also, what will work backwards? ie, a search for "deep venous thrombosis"/"deep vein thrombosis" will match against "dvt".
Also, I would appreciate some advice on how to set these up since I'm new to sphinx in general.
You will need to provide your own list of word/term synonyms to be used in query expansion.
Since Sphinx does not currently support synonym expansion in queries, you'll need to massage the query based on your list of synonyms before submitting it to the search engine.
So, using your example:
User queries for: 'dvt remediation procedures'.
Server receives query and checks each term against its list of synonyms.
Server finds a match and adds 'deep vein thrombosis' to query.
Server submits newly expanded query 'dvt deep vein thrombosis remediation procedures' to search engine.
Finally, if the stemmer built into Sphinx is doing its job, you shouldn't have to support both 'venous' and 'vein' as separate terms since they both should stem to the same term. If this is not the case, you might need to do additional pre-stemming to handle words specific to your corpora (medical terms).