Searching for users in a graph database(neo4j), where there can be multiple(similar not duplicate) instances of a single user - search

I am working on a metadata search project in which I have a graph database.
The graph database is created by combining multiple datasets which contain information about users (purchase history, CRM data, etc.) and contains information like user name, age, date of birth, email, SSN etc.
So for a single user, there could be multiple instances in the database(one from CRM, one from purchase history, etc.). I want to build a search system that should take some user metadata as a query and return information combined from all instances of a user.
Is resolving records online(while executing a query instead of previously resolving all the results) is a Good Idea?
Is a Graph Database like neo4j a good choice for this task?
If yes, what is the optimal way of achieving this in neo4j?

Related

MongoDB, how to manage user related records

I'm currently trying to learn Node.js and Mongoodb by building the server side of a web application which should manage insurance documents for the insurance agent.
So let's say i'm the user, I sign in, then I start to add my customers and their insurances.
So I have 2 collection related, Customers and Insurances.
I have one more collection to store the users login data, let's call it Users.
I don't want the new users to see and modify the customers and the insurances of other users.
How can I "divide" every user related record, so that each user can work only with his data?
I figured out I can actually add to every record, the _id of the one user who created the record.
For example I login as myself, I got my Id "001", I could add one field with this value in every customer and insurance.
In that way I could filter every query with this code.
Would it be a good idea? In my opinion this filtering is a waste of processing power for mongoDB.
If someone has any idea of a solution, or even a link to an article about it, it would be helpful.
Thank you.
This is more a general permissions problem than just a MongoDB question. Also, without knowing more about your schemas it's hard to give specific advice.
However, here are some approaches:
1) Embed sub-documents
Since MongoDB is a document store allowing you to store arbitrary JSON-like objects, you could simply store the customers and licenses wholly inside each user object. That way querying for a user would return their customers and licenses as well.
2) Denormalise
Common practice for NoSQL databases is to denormalise related data (ie. duplicate the data). This might include embedding a sub-document that is a partial representation of your customers/licenses/whatever inside your user document. This has the similar benefit to the above solution in that it eliminates additional queries for sub-documents. It also has the same drawbacks of requiring more care to be taken for preserving data integrity.
3) Reference with foreign key
This is a more traditionally relational approach, and is basically what you're suggesting in your question. Depending on whether you want the reference to be bi-directional (both documents reference each other) or uni-directional (one document references the other) you can either store the user's ID as a foreign user_id field, or store an array of customer_ids and insurance_ids in the user document. In relational parlance this is sometimes described to as "has many" or "belongs to" (the user has many customers, the customer belongs to a user).

Personalized Search Results for Elasticsearch

How would one go about setting up Elasticsearch so that it returns personalized results?
For example, I would want results returned to a particular user to rank higher if they clicked on a result previously, or if they "starred" that result in the past. You could also have a "hide" option that pushes results further down the ranking. From what I've seen with elasticsearch so far, it seems difficult to return different rankings to users based on that user's own dynamic data.
The solution would have to scale to thousands of users doing a dozen or so searches per day. Ideally, I would like the ranking to change in real-time, but it's not critical.
Elasticsearch provides a wide variety of scoring options , but then to achieve what you have told you will need to do some additional tasks.
Function score query and document terms lookup terms filter would be our tools of our choice
First create a document per user , telling the links or link ID he visited and the links he has liked. This should be housed separately as separate index. And this should be maintained by the user , as he should update and maintain this record from client side.
Now when a user hits the data index, do a function score query with filter function pointing to this fields.
In this approach , as the filter is cached , you should get decent performance too.

How to auto replicate data in cassandra

I am very new to cassandra and currently in early stage of project where i am studying cassandra.
Now since cassandra says to de-normalize data and replicate it. So, i have a following scenario :
I have table, user_master, for users. A user has
subject [type text]
hobbies [type list]
uid [type int]
around 40 more attributes
Now, a user wants to search for another user. This search should look for all user who matches the subject and hobbies provided by user. For this reason i am planning to make a different table user_discovery which will have following attribute only for every user
subject [type text]
hobbies [type list]
uid [type int]
*other irrelevant attributes won't be part of this table.
Now my question is:
Do i need to write on both tables for every insert/update in user_master? Can updation of user_discovery be automated when their is any insert/update in user_master.
Even after studying a bit, i am still not so much sure that making a separate table would increase the performance.Since, number of users would be same in both table (yes, number of column would be very less in user_discovery). Any comment on this would be highly appreciated.
Thanks
The idea of separate tables for queries is to have the key of the table contain what you are looking for.
You don't say what the key of your second table looks like, but your wording "the following attributes for every user" looks like you plan to have the user (Id?) as key. This would indeed have no performance advantage.
If you want to find users by their hobby make a table having the hobby as key, and the user id (or whatever it is you use to look up users) as columns. Write one row per hobby, listing all users having that hobby. Write the user into every row matching one of his hobbies.
Do the same for the subject (i.e. separate table, subject as key, user ids as columns).
Then, if you want to find a user having a list of specific hobbies, make one query per hobby, creating the intersection of the users.
To use these kind of lookup-tables you would have indeed to update all table every time you update a user.
Disclaimer: I used this kind of approach rather successfully in a relative complex setting managing a few hundred thousand users. However, this was two years ago, on a Cassandra 1.5 system. I haven't really looked into the new features of Cassandra 2.0, so I have no idea whether it would be possible to use a more elegant approach today.

How to Design Searching Users and Friends using ElasticSearch?

In our app, we have users and users can have friends (think Facebook, relationship is bi-directional). We would like to be able to:
Have a site-wide search for users by name or username
Allow each user to search her friends by name or username
What would be the best approach to design this keeping in mind that:
A user can have up to 50k friends.
Users can change their names and usernames all the time
I am going to suggest another technology which I think will help you with this problem. You can check Neo4j (graph database) which will help you to make relations (user-friend) and traverse graph easily.
You can also use Lucene as an seperate Index engine with Neo4j and make full-text search. Check here.
Also, you can find an examples below which could be helpful.
Lucene Integration with Neo4j
Lucene Full Text Indexing with Neo4j
PS : I have no relationship with Neo4j.
Have documents like:
type:friendship
parties_name:[mark zuckerburg, bill gates]
parties_id:[1, 753634] (what if many people are named bill gates)
So there will be one such row for each friendship in your network, and when our particular mark zuckerburg updates his friendships (and name), all rows parties_id:1 must be reindexed.

Why should (or shouldn't) a Search Query return back only document IDs?

So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...

Resources