Phonetic Algorithm to search Usernames - search

I got DynamoDB to store user profiles. The primary key here is an id. It is necessary that the key is an id.
A user profile contains information like his username, a set of friends,...
So now here is the first problem: user A wants to search user B by his name. I dont want to do a full DynamoDB scan each time this happens.
Since I already got a redis server I though I could just store username-id-pairs there.
So now the real problem: what do I search for?
For example my username could be Eric1996. A friend of mine doesnt remember the last digits so he just searches for Eric19.
Or maybe he just forgets the capital letter at the begining and searches for eric1996. In an other case he might misspell the name like erik1996, erick1996, erich1996.
I searched for that topic a bit and learend that there is something called Phonetic algorithms which search words by what they sound. That would fix the example above.
But would such algorithms work for other usernames as well? You now some users come up with really 3x0tic names or just use random letters. I know a guy who calls himselfe something like dadddddx__7 online.
I assume this is much harder than a spelling corrector since a user might have a name that is misspelled on purpose

Dynamodb or redis is an incorrect tool for your requirements.
I would recommend using dyanmodb or redis for your datastore, and use Solr or ElasticSearch ( or their AWS version Amazon CloudSearch, which provides both solr and elasticsearch)
You can store your user profiles in dynamodb, and store searchable fields in your search store ( you can even store full profiles in search store).
Then search functionalities like spelling errors, ranking friends based on some score are easy to implement.

Related

Neo4j, how to index and give ids to node

I'm kinda new to neo4j, and I want to start building an application with neo4j and nodejs.
From what I understand neo4j adds id to each node it creates, and this id should not be use outside the db, So that means if I have users then looking user by id (the same id neo give when created that user) is not that smart.
So the questions are:
What will be the best why to look for a user? email? could be, but not every thing in the application has a unique identifier like email for a user
I saw few posts writing about uuid, lets assume I'm using it.. can i save save that field with name id? or i need some other name?
Do I need to do something special if I want that field to use as an index? (I want the search by id to be fast.)
uuid generates a very long string, Isn't that a bit of overhead to index that string? indexing a number is faster, no?
if not using uuid what you think is other option?
1) Best way - UUID.
2) Yes.
3) No. You just need to add index to database. Example:
CREATE INDEX ON :User(uuid)
4) That's true that id lookup is faster, especially in Neo4j (due to storage implementation). However index-backed lookup using UUID performs very well and most of the Neo4j users are using this (if there are no another unique identifier in their domain).
5) UUID is the best option. Especially when you take in account - how to generate ID's in clustered setup. UUID's provides possibility to generate unique identifier without taking any global database locks and etc. Here you can read a bit more theoretic information about UUID's.
There are existing Neo4j extensions, which can generate UUID's for you.
For example - GraphAware/neo4j-uuid.
In this extension you can configure property name, for which nodes/relationships UUID's should be applied and etc.

Recommended method to implement prefix search using Firebase

I want to implement prefix search among usernames with Firebase. By prefix search, I mean that after a query foo is performed, usernames with prefix foo are returned. Such a result might return all usernames with prefix foo or just some special subset stored in the data, but the important thing is that it has to contain the username matching exact query if such username exists.
Since Firebase data structure is just a custom designable tree, my first idea was to implement trie data structure over the set of all usernames.
In more details, each path from the root to any node of the trie corresponds to an unique prefix of some set of existing usernames. Then in each node, in addition to the username matching this node if such username exists, we can store explicitly some subset of usernames with prefix corresponding to this node. This allows us to return a subset of all usernames with given prefix very quickly.
Such a tree can be easily updated when a username is added or removed. My idea is that subsets of usernames stored in nodes can be updated periodically based on some external data, which allows to return more valuable result using some measure.
I wonder if there is some other recommended method to achieve the goal, especially with Firebase? Any opinion about the above approach is also appreciated.
Till now firebase doesn't have any api for search like "WHERE foo LIKE ‘%bar%’?", but you can use ElasticSearch based on Lucene, is an extremely powerful document storage and indexing tool. However, at its core is a very simple search feature, which is nearly plug-and-play compatible with Firebase.
Check firebase's official blog post for integrating Elastic Search with firebase
https://www.firebase.com/blog/2014-01-02-queries-part-two.html

How to Design Searching Users and Friends using ElasticSearch?

In our app, we have users and users can have friends (think Facebook, relationship is bi-directional). We would like to be able to:
Have a site-wide search for users by name or username
Allow each user to search her friends by name or username
What would be the best approach to design this keeping in mind that:
A user can have up to 50k friends.
Users can change their names and usernames all the time
I am going to suggest another technology which I think will help you with this problem. You can check Neo4j (graph database) which will help you to make relations (user-friend) and traverse graph easily.
You can also use Lucene as an seperate Index engine with Neo4j and make full-text search. Check here.
Also, you can find an examples below which could be helpful.
Lucene Integration with Neo4j
Lucene Full Text Indexing with Neo4j
PS : I have no relationship with Neo4j.
Have documents like:
type:friendship
parties_name:[mark zuckerburg, bill gates]
parties_id:[1, 753634] (what if many people are named bill gates)
So there will be one such row for each friendship in your network, and when our particular mark zuckerburg updates his friendships (and name), all rows parties_id:1 must be reindexed.

Storing information in a forum like web site

Suppose that we have a web site where each person has a profile and other people write comments to the persons profile. (like the wall in facebook). What is the best way to store the comments made for a person ? I was thinking like a relational database type of thing where there will be a field to hold all the comments for a person in the form of a long string separated with some kind of delimiter but I am not sure if this is the best way. Any ideas ?
You'll have two separate tables one for Users one for Comments, all the entries having their unique IDs, schema would go like:
Users (ID, name, mail, etc)
Comments (ID, for, from, time, content, etc)
Where for and from fields are User IDs.
postgresql, mysql, sqlite or even leveldb if you want simple key value store. There's a lot of tutorials out there to get started with any of them.
The problem with Relational databases is that they do not scale well to super massive social networking sites. When your table starts to get huge the queries will start to take more and more time. If your site is going to be pretty small then a relational database is fine. I think that you may want to investigate "NoSql" databases.
Start here:
http://nosql-database.org/

Using different types/format of _id ok?

I'm using one database for all (users, files and comments).
I was wondering if I can/should use
twitter user id for user doc _id's
md5 hash (of file) for file doc _id's
provided uuid for comment doc _id's
It feels weird to mix those different types of id's.
What speaks agains this scenario? Should I stick to the CouchDB uuid's for consistency?
Use any format of id, or combination of formats, as you see fit. You might wish to add a prefix to ensure there are no overlaps between them, though.
twitter:#rnewson
md5:86f646c11b3bc7d434d06c077aee43d8
And so on.

Resources