SOLR IDF Max docs configuration - search

I'm using SOLR for storing the documents used by search in my application. The SOLR is shared by multiple applications and the data is grouped based on the application id which is unique for each application.
For calculating the score based on TF-IDF the SOLR uses the total documents available in it. How do I change that configuration to check the IDF only based on the total documents available for the application id rather than counting all the documents across applications.

Even if you store all docs in one collection, there is still something you can do!
Unless you enable ExactStatsCache in your solrconfig.xml like this:
<statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>
similarity calculations are per shard, not per total collection.
So, if you shard your docs by your application_id, then you will get 'better' scores, closer to that you want. It will be exactly what you want if you get one application_id per shard, but if you have many applications and not many shards you will get more than one app per shard.

If you store them in one collection, I am afraid it's not possible with built-in functionality.
I think you have several choices - store each application data in the separate collection, than you will have IDF based only on specific application data out of the box.
If this is not suitable for you - you will need to write your own Similarity, probably by exteding https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html and overriding method public abstract float idf(long docFreq, long docCount) which is responsible for calculating IDF
Overall, I think the first approach will suit your needs much better.

Related

GAE datastore data model recommendation for nested "same kind" relations

I have followed through Bookshelf App tutorial (in node.js) by google and instead of books catalogue I would like to model a production part catalogue.
Where a part consists of "sub"-parts and tasks.
Every "sub"-part can have again "sub"-parts and tasks (manufacturing steps).
Current implementation: At the moment I have only two kinds Parts and Tasks.
A relations between the parts is managed via a property storing the unique key (parentId) of the parent part in its child part. A bigger headache I have at the moment (for example) is a price change of a highly nested sub-part would be recursively need to update all parent parts...
Question: What would be the recommended datastore design for such an application?
It should solve or be more efficient doing:
If i change a "sub-sub-sub"-parts price this need to change the price of all parent parts according the chosen calculation methodology.
Should not be limited in depth of sub-parts (I did read limits on datastore "nested entity values" to be 20 (but probably did not understand it correctly).
Should not be limited to 1 write per second per (part and all its sub-parts) "entity group". I've read about this limit but I am not sure whether this also applies to so called Transactions (which I think you can do on entity groups).
One potential solution is avoid storing aggregate prices in Datastore entirely. Instead, the "price" on each part or task should only include the cost of that thing itself, but not the sub-parts.
Instead calculate the price on the fly when needed, adding up the entire tree of parts/sub-parts/tasks. Store this in memcache if you want to speed up calculation (but make sure to delete the memcache key when updating prices).

homogeneous vs heterogeneous in documentdb

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?
Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

Look ahead search on document fields in azure DocumentDb

We are interested in using DocumentDb as a data store for a number of data sources and as such we are running a quick POC to establish whether it meets the criteria we are looking for.
One of the areas we are keen to provide is look ahead search capabilities for certain fields. These are traditionally provided using the SQL LIKE syntax which does not appear to be supported at present.
Searching online I have seen people talking about integrating Azure search but this appears to be a very costly mechanism for such a simple use case.
I have also seen people mention the use of UDF's but this appears to require an entire collection scan which is not practical from a performance perspective.
Does anyone have any alternative suggestions? One thing I considered was simply using a SQL table and initiating an update each time a document was inserted\updated\deleted?
DocumentDB supports STARTSWITH and range indexes to support prefix/look ahead searching.
You can progressively make queries like the following based on what your user types in a text box:
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "H")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hi")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hil")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hilton")
Note that you must configure the collection, or the path/property you're using for these queries with a range index. You can extend this approach to handle additional cases as well:
To query in a case-insensitive manner, you must store the lower case form of the search property, and use that for querying.
I faced a similar situation, where a fast lookup was required, as a user typed search terms.
My scenario was that potentially thousands of simultaneous users would be performing such lookups; when testing this under load, to avoid saturation and throttling, we found we would have to increase the DocumentDB Request Unit (RU) throughput amount to a point that was not financially viable for us, in our specific circumstances.
We decided that DocumentDB was best used as the persistent store, and 'full' data retrieval - and this role it performs exceptionally well - while a small ElasticSearch cluster performed the role it was designed for - text search, faceted search, weighted search, stemming, and most relevant to your question, autocomplete analyzersand completion suggesters.
The subject of type ahead queries, creation of indexes, autocomplete analyzer and query time 'search as you type' in ElasticSearch can be found here, here and here
The fact that you plan to have several data sources would also potentially make the ElasticSearch cluster approach more attractive, to aggregate search data.
I used the Bitnami template available in the Azure market place to create relatively small instances, and most importantly, this allowed me to place the cluster on the same Virtual Network as my other components, which greatly increased performance.
Cost was lower than Azure Search (which uses ElasticSearch under the hood).

Scalability of Solr and ElasticSearch: fields of 5000 values

I need to send records to a search engine (Solr or ElasticSearch) to index.
In my design, a field can have up to 5000 values and for some records, ALL these 5000 values (OR or AND relationship) of this field need to be sent to the search engine.
I have about 10 fields of this nature, plus 30 other fields (text, integer, etc.).
I wonder whether Solr or ElasticSearch can effectively handle a large number of values of a field and which one does a better job.
What about millions of records in this situation?
What about real time indexing in already-millions-of-records-and-keep-growing situation? I understand Solr NRS and ElasticSearch can do real-time indexing, but I am not sure whether my situation poses new challenges.
Thanks for any input!
Cheers!
Both Solr and ElasticSearch are based on Lucene, which does the real indexing/querying/storing documents. So performance, in terms of size of fields and documents, should be pretty similar in both.
The choice between one or the order should be probably based on which one you find most enjoyable to work with. ElasticSearch, for example, has a JSON API for querying and indexing, while Solr uses pretty much XML for configuration and querying.
If you're going to have millions of documents and/or will have the need to divide the insert/query load in a cluster of machines ElasticSearch has, in my opinion, an advantage because of the easiness to shard and create replicas.
Regarding the real-time search, both will probably suit your needs. They allow you to customize how frequently it will "refresh" the index. Allowing new documents, that were just indexed, to appear in search results. For example, in ElasticSearch you can set a refresh to occur once a minute.

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

Resources