Creating an ElasticSearch index mixing two CouchDB rivers - couchdb

I am trying to index two different CouchDB databases with Elastic Search, using a single index, but it seems that only one of the two databases is actually indexed. The documents from the other database are simply not included in the index.
Is there any limitation about the number of rivers which can be connected to a given index?
I am unable to find any documentation about this specific use case ...

You should create two index with one river for each.
Afaik you can not have multiple rivers for the same index. See the content of your _river index.
BTW you should ask questions to the elasticsearch.org mailing list.

Related

Finding the number of documents that contain a term in elasticsearch

I have an Elasticsearch index that contains around 2.5 billion documents with around 18 million different terms in an analyzed field. Is it possible to quickly get a count of the number of documents that contain a term without searching the index?
It seems like ES would store that information while analyzing the field, or perhaps be able to count the length of an inverted index. If there is a way to search for multiple terms and get the document frequency for each of the terms, that would be even better. I want to do this thousands of times on a regular basis, and I can't tell if there is an efficient way to do that.
You can use the Count API to just return the count from a query, instead of a full document listing.
As far as whether Elasticsearch gives you a way to do this without a query: I'm reasonably confident Elasticsearch doesn't have a store of that information outside the index, because that is exactly what a lucene index already does. That's what an inverted index is, a map of documents indexed by term. Lucene is designed around making looking up documents by term efficient.

full text search in databases

I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.

Difference between Elastic Search and Google Search Appliance page ranking

How does the page ranking in elastic search work. Once we create an index is there an underlying intelligent layer that creates a metadata repository and provides results to query based on relevance. I have created several indices and I want to know how the results are ordered once a query is provided. And is there a way to influence these results based on relationships between different records.
Do you mean how documents are scored in elasticsearch? Or are you talking about the 'page-rank' in elasticsearch?
Documents are scored based on how well the query matches the document. This approach is based on the TF-IDF concept term frequency–inverse document frequency. There is, however, no 'page-rank' in elasticsearch. The 'page-rank' takes into consideration how many documents point towards a given document. A document with many in-links is weighted higher than other. This is meant to reflect the fact whether a document is authoritative or not.
In elasticsearch, however, the relations between the documents are not taken into account when it comes to scoring.

Is it possible to have documents with a subset of fields of the collection's schema under one solr collection?

We have 4 different data sets and want to perform faceted search on them.
We are currently using SolrCloud and flattened these data sets before indexing them to Solr. Even though we have relational data, our primary goal is faceted search and Solr seemed like the right option.
Rough structure of our data:
Dataset1(col1, col2, col3,col4)
Dataset2(col1,col6,col7,col8)
Dataset3(col6,col9,col10)
Flattened dataset: dataset(col1,col2,col3,col4,col6,col7,col8,col9,col10).
In the end, we flattened them to have one common structure and have nulls where values do not exist. So far Solr works great.
Problem: Now we have additional data sets coming in and each of them have about 50-60 columns. Technically, I can still flatten these too, but I don't think it is a good idea. I know that I can have different collections with different schemas for each data set. But, we perform group by's on these documents so we need one schema.
Is there any way to maintain documents with a subset of fields of the schema under one collection without flattening them? If not, is there a better solution for this problem?
For instance:
DocA(field1, field2) DocB(field3,field4).
Schema(field1, field2, field3, field4).
Can we have DocA and DocB under one collection with the above schema?
Our backend is on top of Cloudera Hadoop (CDH4.6 and 5.2) distribution and we can choose any tool that belongs to the Hadoop ecosystem for a possible solution.
Of course you can, they only need a different uniquekey for each document. If you have defined a fixed solr schema, maybe dynamicfields can help you.

Sphinx Search Multiply Indexes & Sources

Im making a dynamic CMS, so every instance of the CMS will have its on tables in one MYSQL DB. So far all is working.
The Envorioment:
8 different Sites with different content. they share only the DB name but all have differenttables ($sitename_posts)
search enigne SPHINX
Now im stuck at this: when for example user makes a search on site 1 i want search all tables $sitename_posts and return the best results.
As search engine i use sphinx. I have tried it with two sources and two indexes but when i search for example:
$sphinx = new SphinxClient;
$sphinx->setServer($sphinx_host, $sphinx_port);
$sphinx->setMatchMode(SPH_MATCH_ANY);
$sphinx->setMaxQueryTime(10000);
$sphinx->SetSortMode(SPH_SORT_EXTENDED, '#relevance DESC');
$sphinx->SetLimits(0, 100, 300);
$result = $sphinx->query("Hello World", (index1 index2);
I get no results. But if i build only one INDEX and multiply sources i get results, but i cant identify from which source i get the data, so i cant judge to which site the content belongs.
One more question is when i search the indexes, is it possible, that sphinx returns, the ID and to what index that id belongs? Cause i need to indentify which index belongs to which result.
Thanks for help!
If I understand the question correctly I it's worth you looking into the following Sphinx features:
Distributed Indexes - This would allow you to have one index per site and also have a "virtual" distributed index which you could search from the application when you want to get data.
Index Merging - This is more permanent than the distributed index option but the indexer is able to merge multiple indexes into a single index. I would usually prefer to use distributed indexes.
Attributes - This would allow you to include a constant value in each of the indexes (e.g. siteId) which would allow you to identify which of the indexes the search result came from. It could also allow you to filter results when searching from the single distributed index.
Sphinx docs - http://sphinxsearch.com/docs/2.0.1/
Distributed indexes explained - http://sphinxsearch.com/docs/2.0.1/distributed.html
Configuring distributed indexes - http://sphinxsearch.com/docs/2.0.1/confgroup-index.html

Resources