Sunspot and Nutch with Solr

Sunspot and Nutch with Solr - search

I'm a little confused about the use of sunspot with solr.
I worked with solr and nutch, let's say that I have solr with all my data indexed by nutch and now I want to configure sunspot, only to search data in solr, not to index to it.
I have been investigating but, the info I saw is always about configuring models first, to index to solr.
Take a look below, how can I configure controllers in Ruby on Rails with sunspot to get these results from my solr?
{
"content": "Varadero es la mejor playa de Cuba, recientemente se remodelo",
"title": "Varadero ampliada la Internet",
"segment": "20131114152100",
"boost": 1,
"digest": "e6cc9412d5066dae9e176fd7bc598913",
"tstamp": "2013-11-14T15:47:55.235Z",
"id": "http://blogs.uclv.edu.cu/blog/1039#main-content",
"url": "http://blogs.uclv.edu.cu/blog/1039#main-content",
"_version_": 1451712964205740000
}
Thanks in advance for your time

Sunspot is designed to index/search ActiveRecord models. It adds a search method to the models.
If you are not using ActiveRecord or some other database persistence for these items in your Rails application, you should just use rsolr, not sunspot.

Related

Cassandra way to create and query nested list of objects

I have a model of posts and their corresponding comments, like this:
{
"id": "1234",
"moment": "2021-02-19T10:00:00Z",
"body": "Good morning!",
"author": "Bob",
"comments": [
{
"body": "Take care!",
"moment": "2021-02-19T11:13:00Z",
"author": "Bob"
},
{
"body": "Hey there!",
"moment": "2021-02-19T11:15:00Z",
"author": "Maria"
}
]
}
Using Cassandra 3.11.10, I managed to create and query a case insensitive LIKE search by text contained in post body:
CREATE TABLE post(
id uuid,
moment timestamp,
body text,
author varchar,
PRIMARY KEY (id)
);
CREATE CUSTOM INDEX body_idx ON post (body) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer','case_sensitive': 'false'};
INSERT INTO post (id, moment, body, author) VALUES (uuid(), '2021-02-19T10:00:00Z', 'Good morning!', 'Bob');
SELECT * FROM post WHERE body LIKE '%morning%';
But how can I create a table structure for nested comments, and also search text in both post and comments bodies?

First, please, keep in mind that Cassandra's modeling is different from a relational database, and unfortunately, the denormalization is your friend on NoSQL, especially at Cassandra.
You need to focus on the data you want to retrieve from the database or a do query-driven design.
However, if you cannot remodeling the system, there are a couple of solutions:
Append a Search Engine such as Sorl or ElasticSearch with Cassandra: This solution enables you to create a second service to do the whole search engine in this service.
Pro: You can keep the Cassandra model similar to a relational.
Cons: More operation complexity to maintain two services increases the difficulty of managing data from two different sources and synchronize them.
Use Stratio: that is a plugin where you add a Lucene index on Cassandra.
Pro: It has a full-text search engine integrated with Cassandra. You don't need to have a new infrastructure service and worry about the data replication in different services. It has support to UDT. Thus you can define the comments as UDT and search typically.
Cons: You need to include a jar in each Cassandra node. Besides, once Cassandra and Lucence are working on the same machine, it might impact performance, thus look at the references.
Cassandra DSE: A commercial version that has several features, such as a search integration.
Pro: A Search engine integrated with several features.
Cons: There is not a free version.

How does MultiLanguage search engines Work

Today , when searching some videos on youtube i found out that youtube can return relevant results even if you search for videos in languages other than english.
Tried searching about this on google , but all i got was some api's to do this programmatically.Can someone throw some light on the theory behind this.Papers/Links/Explanations,anything would do.
Thanks

When I've done this with elasticsearch, I've simply mapped multiple fields for each document, like:
"text_val": {
"type": "text",
"fields": {
"en": {
"type": "text",
"analyzer": "english"
},
"it": {
"type": "text",
"analyzer": "italian"
}
}
}
And then just search both fields for every query. This works well and is good enough for many applications. However I'm sure Google is doing something much more complex, certainly language identification on both the indexed documents and the query. In case you want to do language identification, I've used python langid before and had good results.
The problem you're going to face using elasticsearch for this kind of thing, in my experience, isn't the multi-language part, but that the analyzers for languages other than English don't always work as well as you would like. You may have to write a custom analyzer, with rules to handle lots of special cases, and tuned for your specific dataset.

Azure Search: Is there support for conjugation in the French or any language analyzer?

I am facing a business requirement for the French language that conjugation must be supported. For example, if the user searches for "Être" then it should also find variations of the form of the verb (voice, mood, tense, etc).
Based on what I have seen, Azure Search fr.microsoft analyzer (or custom analyzer built-on top of this) supports it. I have verified this by searching for "Être" and finding documents with: est, EST, sera, sont and etre.
It does not, however, find documents with the following: ete, etes, Ete, Etes.
I searched and found this page which documents the simple and compound forms of Être.
http://conjugator.reverso.net/conjugation-french-verb-%C3%AAtre.html
It does not look like the Microsoft French language analyzer supports all of them. Is this true? If so, then how do I ensure all are handled? Do I need to add "ete" and "etes" as synonyms for "Être"? If so, would I also need to add "Ete" and "Etes" as synonyms for "Être" as well?
Is there a way for me to get documentation on all the French conjugation support in Azure Search?
Last but not least, how do I better understand ALL the conjugation for "Être"? I tried using the Analyzer API...
{ "analyzer": "fr.microsoft", "text": "Être" }
But I only get the following responses:
{
"#odata.context": "https://one-adscope-search-poc2.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult",
"tokens": [
{
"token": "etre",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "être",
"startOffset": 0,
"endOffset": 4,
"position": 0
}
]
}

In Azure Search, our linguistic analyzers use normalized forms to match different conjugations of the word. For example, at indexing time, the Microsoft analyzer analyzes the word 'sont' to 'etre' and indexes both the original and the normalized/lemmatized form of the word. At query time, say you are issuing a search query with 'est'. The word 'est' also analyzes to 'etre' and finds the document containing 'sont'. The responses from the Analyze API you shared align with this expectation.
Unfortunately, we don't provide exhaustive list of conjugations in our documentation. You may be able to generate the list using a sample of your documents and using the response from the Analyze API.
Finally, you can use our synonyms feature to fill in the missing gap. I noticed that the words that are not matching(ete, etes, Ete, Etes) all analyze to the baseform 'ete'. You can define a synonym rule that says 'etre' and 'ete' are equivalent. The synonyms feature is currently in private preview. Feel free to reach out to me at nateko AT microsoft if you want to try out
Hope this helps.
Nate

Google Freebase Search API Alternative?

Google deprecated their Freebase Search API, and is transferring things over to Wikidata, however there appears to be no replacement for their Freebase Search API (https://developers.google.com/freebase/v1/search-overview) that:
Autosuggesting entities (e.g. Freebase Suggest Widget)
Getting a ranked list of the most notable entities with a given name.
Finding entities using Search Metaschema.
Moreover, it would also take in malformed strings and correct them, and return nice detailed relevancy rankings, along with the associated freebase topic id. I can't find anything in their Custom Search API that returns any information relevant to their, or any other knowledge graph.
Ideally would like something that I can query similar to this and returns a result like they used to:
For example, a query of "Nirvana" in the Freebase Search API would return:
{
"status":"200 OK",
"result":[
{
"mid":"/m/0b1zz",
"name":"Nirvana",
"notable":{"name":"Record Producer","id":"/music/producer"},
"score":55.227268
},{
"mid":"/m/05b3c",
"name":"Nirvana",
"notable":{"name":"Belief","id":"/religion/belief"},
"score":44.248726
},{
"mid":"/m/01h89tx",
"name":"Nirvana",
"notable":{"name":"Musical Album","id":"/music/album"},
"score":30.371510
},{
"mid":"/m/01rn9fm",
"name":"Nirvana",
"notable":{"name":"Musical Group","id":"/music/musical_group"},
"score":30.092449
},{
"mid":"/m/02_6qh",
"name":"Nirvana",
"notable":{"name":"Film","id":"/film/film"},
"score":29.003593
},{
"mid":"/m/01rkx5",
"name":"Nirvana Sutra",
"score":21.344824
}
],
"cost":10,
"hits":0
}
Note the relevance, and Freebase mid.
Essentially are there any alternatives out there, either open source, or commercial that replaces this much needed functionality?

I've used the Prismatic Interest graph API for somewhat similar functionality. My use-case was a bit different (tagging documents with topics) but looking at their API endpoints you might be able to duplicate the functionality you described above with a query to topic/search (search for topics that match a search string) and a query to topic/topic to search for similar topics (sorted by score).
Edit
As David notes in the comments below, the Prismatic Interest Graph API has been discontinued.
Also, the Google Knowledge Graph Search API now seems to be the intended replacement for the Freebase Search API.

How about the Google Knowledge Graph Search API? There is also a web application exposing the API.

The :BaseKB project offers Freebase data (plus some other data) as RDF. :BaseKB's data can be downloaded for free or easily run on an AWS instance for live queries. The AWS machine image contains a Virtuoso database so you can query it with the SPARQL query language.

couchdb match multiple inconsistent keys

Considering the following two documents:
{
"_id": "a6b8d3d7e2d61c97f4285220c103abca",
"_rev": "7-ad8c3eaaab2d4abfa01abe36a74da171",
"File":"/store/document/scan_bgd123.jpg",
"Commend": "Describes a person",
"DateAdded": "2014-07-17T14:13:00Z",
"Name": "Joe",
"LastName": "Soap",
"Height": "192cm",
"Age": "25"
}
{
"_id": "a6b8d3d7e2d61c97f4285220c103c4a9",
"_rev": "1-f43410cb2fe51bfa13dfcedd560f9511",
"File":"/store/document/scan_adf123.jpg",
"Comment": "Describes a car",
"Make": "Ford",
"Year": "2011",
"Model": "Focus",
"Color": "Blue"
}
How would I find a document based on multiple criteria, say for example "Make"="Ford" and "Color"="Blue". I realize I need a view for this, but I don't know what the key is going to be, and as you can see from the two documents, the key/value pairs aren't consistent. The only consistent item will be the "File" key.
I'm attempting to create couchDB database that will store the location of files, but tagged with Key/Value pairs.
EDIT:
Perhaps I should reconsider my data structure. modify it slightly?
{
"_id": "a6b8d3d7e2d61c97f4285220c103c4a9",
"_rev": "1-f43410cb2fe51bfa13dfcedd560f9511",
"File": "/store/document/scan_adf123.jpg",
"Tags": {
"Comment": "Describes a car",
"Make": "Ford",
"Year": "2011",
"Model": "Focus",
"Color": "Blue"
}
}
So, I need to find by the Key>Value pair in the tag or any number of Key>Value pairs to filter which document I want. The problem here is, I want to tag objects with a key>value pair. These tags could be very different per view, so the next document will have a whole diff set of Key>Value pairs.

Couchdb supports flexible schema. There is no need for the documents to be consistent for them to be query-able. The view for your scenario is pretty straightforward. Here is the map function that should do the trick.
function(doc){
if(doc.Make&&doc.Color)
emit([doc.Make,doc.Color],null);
}
This gives you a view which you can then query like
/view-name/key=["Ford","Blue"]&include_docs=true
This should give you the desired result.
Edit based on comment
For that you will need two separate views. Every view in couchdb is designed to fulfil a specific query need. This means that you have to think about access strategy of your data. It is a lot more work on your part initially but for the trouble you are rewarded with data that is indexed and has very fast access times.
So to answer your question directly. Create two views. One for Make like we have already done and other for Name like
function(doc){
if(doc.Name&&doc.LastName)
emit([doc.Name,doc.Name],null);
}
Now the Name view will index only those documents that have name in it. Where as Make view will index those documents that have make in it.
What happens when a requirement comes in future for which you don't have a query?
You can try a few things.
This is probably the easiest solution. Use couchdb-lucene for your dynamic queries. In this case your architecture will be like couchdb views for queries that you know your application would need. Lucene index for queries that you don't know you might need. So for instance you have indexed name and last name in the in couchdb query. But a requirement arises and you might need to query by age then simply dump the age field in lucene and it will take care of the rest.
Another approach is using the PPP technique where you exploit the fact that creating views is a one time cost and you can create views on less active hours and deploy them in a production service once they are built.
Combine steps 1 and 2! lucene to handle adhoc request while you are building views using the ppp technique.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string