What full-text search technology is out there to support full-text personalized search?
For instance, contact search in your webmail provider of choice: it's full text but only searches your personal contacts and not the entire universe of contacts.
There are countless full-text search packages out there but I don't know how you could use most full-text search packages such that every user only sees a small subset of the universe of documents.
In the case of email, it's simple: use any popular search toolkit and build an index per user. It's simple because the indexes shouldn't overlap, or you'd be violating users' privacy. Also, overlap might skew figures like IDF. (You might be tempted to index emails sent to multiple users only once, but the security and privacy implications of that aren't worth it. Disk is cheap.)
If a common collection of documents should be indexed for personalized search, you're on your own, I'm afraid.
I would recommend build lucene index of all contacts with special fields like contact_list_id, usage_freuency. At time of search for each user add their specific params ie text:"John smith" AND contact_list_id:"$current_user_id" order by usege_freuency. In this case you will have optimized index all data compressed in one place and it is also personilized by field like usage_freuency or more robust rank. Think about index as DB with highly effective search by text.
Related
We are switching from SQL Fulltext Search to Lucene (SOLR stack) search in the next few months. One last wrinkle in figuring out our strategy here has to with replicating one current part of our search platform.
First, some nomenclature to describe the problem: Our site has a bunch of documents. People might "add" those documents, they might "favorite" those documents, they might "read" those documents, etc. Let's call that union of such documents for a given user their "personal documents". Some documents are public, and some are private so that only the logged-in-user can see them.
Currently, we have a weighting function that will always show a given user's "personal" documents FIRST in the search list, for any search. This outranks the normal order (but a document must be valid in the result set -- it just ranks above any other less important document). In SQL, we are able to achieve this by having a user-defined-function that returns a score, and it varies by user.
An analogy is Facebook -- where, when you type "Joe", it will first find all the Joes that you know, followed by any other Joe that meets the criteria. My search for "Joe" will return a different ordered set than your search for Joe.
In the world of Lucene/SOLR, as I understand it, I cannot figure out how to have such user-centric weighting of documents without two separate queries that are then effectively UNIONed together (I know, it's not relational, but you get the idea). We have millions of users, and hundreds of thousands of documents. If a user is logged in, we want "their documents" to show up first in any search, then the rest of all documents. And in each case, we want the search results to show only those documents that match the original search -- we're just talking about rank-order.
Can you think of any strategies here to reproduce this user-defined-function feature?
Can you afford to have a field in each document telling this particular document belongs to Jim (e.g. user123Doc:1)? If yes, you could solve it by sorting the result set by {user123Doc, score, ...}.
Or, if you don't want to store this information in Lucene, you can store this elsewhere (e.g. in the database) and implement FieldComparator so it works with these values. More on this is available here.
In our app, we have users and users can have friends (think Facebook, relationship is bi-directional). We would like to be able to:
Have a site-wide search for users by name or username
Allow each user to search her friends by name or username
What would be the best approach to design this keeping in mind that:
A user can have up to 50k friends.
Users can change their names and usernames all the time
I am going to suggest another technology which I think will help you with this problem. You can check Neo4j (graph database) which will help you to make relations (user-friend) and traverse graph easily.
You can also use Lucene as an seperate Index engine with Neo4j and make full-text search. Check here.
Also, you can find an examples below which could be helpful.
Lucene Integration with Neo4j
Lucene Full Text Indexing with Neo4j
PS : I have no relationship with Neo4j.
Have documents like:
type:friendship
parties_name:[mark zuckerburg, bill gates]
parties_id:[1, 753634] (what if many people are named bill gates)
So there will be one such row for each friendship in your network, and when our particular mark zuckerburg updates his friendships (and name), all rows parties_id:1 must be reindexed.
I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.
So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...
I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.