Scalability of Solr and ElasticSearch: fields of 5000 values - search

I need to send records to a search engine (Solr or ElasticSearch) to index.
In my design, a field can have up to 5000 values and for some records, ALL these 5000 values (OR or AND relationship) of this field need to be sent to the search engine.
I have about 10 fields of this nature, plus 30 other fields (text, integer, etc.).
I wonder whether Solr or ElasticSearch can effectively handle a large number of values of a field and which one does a better job.
What about millions of records in this situation?
What about real time indexing in already-millions-of-records-and-keep-growing situation? I understand Solr NRS and ElasticSearch can do real-time indexing, but I am not sure whether my situation poses new challenges.
Thanks for any input!
Cheers!

Both Solr and ElasticSearch are based on Lucene, which does the real indexing/querying/storing documents. So performance, in terms of size of fields and documents, should be pretty similar in both.
The choice between one or the order should be probably based on which one you find most enjoyable to work with. ElasticSearch, for example, has a JSON API for querying and indexing, while Solr uses pretty much XML for configuration and querying.
If you're going to have millions of documents and/or will have the need to divide the insert/query load in a cluster of machines ElasticSearch has, in my opinion, an advantage because of the easiness to shard and create replicas.
Regarding the real-time search, both will probably suit your needs. They allow you to customize how frequently it will "refresh" the index. Allowing new documents, that were just indexed, to appear in search results. For example, in ElasticSearch you can set a refresh to occur once a minute.

Related

SOLR IDF Max docs configuration

I'm using SOLR for storing the documents used by search in my application. The SOLR is shared by multiple applications and the data is grouped based on the application id which is unique for each application.
For calculating the score based on TF-IDF the SOLR uses the total documents available in it. How do I change that configuration to check the IDF only based on the total documents available for the application id rather than counting all the documents across applications.
Even if you store all docs in one collection, there is still something you can do!
Unless you enable ExactStatsCache in your solrconfig.xml like this:
<statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>
similarity calculations are per shard, not per total collection.
So, if you shard your docs by your application_id, then you will get 'better' scores, closer to that you want. It will be exactly what you want if you get one application_id per shard, but if you have many applications and not many shards you will get more than one app per shard.
If you store them in one collection, I am afraid it's not possible with built-in functionality.
I think you have several choices - store each application data in the separate collection, than you will have IDF based only on specific application data out of the box.
If this is not suitable for you - you will need to write your own Similarity, probably by exteding https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html and overriding method public abstract float idf(long docFreq, long docCount) which is responsible for calculating IDF
Overall, I think the first approach will suit your needs much better.

full text search in databases

I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.

Cassandra store and query dynamic (user defined) data

We've been looking into using Cassandra to store some of the larger data in a multi-tenant system we are building. The decision to use Cassandra is mostly to do with scaling capabilities and performance when working with large data sets, but I am not sure whether what we're looking for is possible in Cassandra, so I'm hoping someone has some clues as to whether (and how) this could be done:
We are looking for a way to provide our users to first define their own Entity types then define fields in those entities (and field types). Once they've defined this, their data (that matches the definitions they just created) could be imported, stored and most importantly queried by pretty much any field they defined.
So for instance, we may have one user who defines an Airplane, which has the manufacturer name, model, tail number, year of production, etc...
Their data will, then, contain those fields, be searchable and sortable by those fields, etc..
Another user may decide to define a Boat, which can then have different fields, which should be also sortable and searchable by content.
Because of the possible number of entries - the typical relational approach is unlikely to yield adequate performance, so we're looking at a noSQL approach.
Is this something that could be done in C*? Or are there any other suggestions in terms of a storage engine that would offer best flexibility?
I can see two important points in your requirements
Dynamic typing/schemaless data: Cassandra defines how data is structured like a relational database. Yet you can use columns of complex type: map...
Query by any field: Cassandra requires each query to provide the partition id. Cassandra data model is driven by querying, if you don't know your queries in advance, you won't be able to design the appropriate model, and you won't be able to query it.
I advise you to have look at Elasticsearch.
Then, if you have to use Cassandra for some other reason, then I advise you to look a DataStax Enterprise edition of Cassandra which integrates with SolR and Spark: both will give you extra querying capabilities.

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

Resources