I want to implement prefix search among usernames with Firebase. By prefix search, I mean that after a query foo is performed, usernames with prefix foo are returned. Such a result might return all usernames with prefix foo or just some special subset stored in the data, but the important thing is that it has to contain the username matching exact query if such username exists.
Since Firebase data structure is just a custom designable tree, my first idea was to implement trie data structure over the set of all usernames.
In more details, each path from the root to any node of the trie corresponds to an unique prefix of some set of existing usernames. Then in each node, in addition to the username matching this node if such username exists, we can store explicitly some subset of usernames with prefix corresponding to this node. This allows us to return a subset of all usernames with given prefix very quickly.
Such a tree can be easily updated when a username is added or removed. My idea is that subsets of usernames stored in nodes can be updated periodically based on some external data, which allows to return more valuable result using some measure.
I wonder if there is some other recommended method to achieve the goal, especially with Firebase? Any opinion about the above approach is also appreciated.
Till now firebase doesn't have any api for search like "WHERE foo LIKE ‘%bar%’?", but you can use ElasticSearch based on Lucene, is an extremely powerful document storage and indexing tool. However, at its core is a very simple search feature, which is nearly plug-and-play compatible with Firebase.
Check firebase's official blog post for integrating Elastic Search with firebase
https://www.firebase.com/blog/2014-01-02-queries-part-two.html
Related
I have worked on Azure Search service previously where I created an indexer directly on a SQL DB in the Azure Portal.
Now I have a use-case where I would want to ingest from multiple data sources each having different data schema. Assume these data sources to be 3 search APIs of X,Y,Z teams. All of them take search term and gives back results in their own schema. I want my Azure Search Service to be proxy for these so that I have one search API that a user can use to get results from multiple sources, ordered correctly.
How should I go about doing it? I assume that I might have to create a common schema and whenever user searches something, I would call these 3 APIs and get results, map them to a common schema and then index this data in common schema into Azure Search index. Finally, call this Azure Search API to give back the results to the caller.
I would appreciate any help! If I can get hold of a better documentation for doing this work, that will be great as well.
Your assumption is correct. You can work with 3 different indexes and fire queries against them, or you can try to combine all of them in the same index. The benefit of the second approach is a better way to implement ordering / paging as all the information will be stored in the same index.
It really depends on what you mean by ordered correctly. Should team X be able to see results from teams Y and Z? The only way you can get ranked results like this is to maintain a single index with a common schema containing data from all teams.
One potential pitfall with this approach is conflicts in the schema. For example if one team requires a field to be of a specific datatype or use a specific analyzer, while another team has different requirements. We do this in our indexes, but with some carefully selected common fields and then dedicated fields prefixed according to our own naming convention to avoid conflicts.
One thing to consider is the need to reset the index. If you need to add, change or remove fields you will have to delete the index and create it again with a new schema. If you have a common index and team X needs to add a new property, you would need to reset (delete and create) the common index which affects all teams.
So, creating separate indexes per team has its benefits. Each team can have their own schema without risk of conflicts and they can reset their index without affecting the other teams.
I got DynamoDB to store user profiles. The primary key here is an id. It is necessary that the key is an id.
A user profile contains information like his username, a set of friends,...
So now here is the first problem: user A wants to search user B by his name. I dont want to do a full DynamoDB scan each time this happens.
Since I already got a redis server I though I could just store username-id-pairs there.
So now the real problem: what do I search for?
For example my username could be Eric1996. A friend of mine doesnt remember the last digits so he just searches for Eric19.
Or maybe he just forgets the capital letter at the begining and searches for eric1996. In an other case he might misspell the name like erik1996, erick1996, erich1996.
I searched for that topic a bit and learend that there is something called Phonetic algorithms which search words by what they sound. That would fix the example above.
But would such algorithms work for other usernames as well? You now some users come up with really 3x0tic names or just use random letters. I know a guy who calls himselfe something like dadddddx__7 online.
I assume this is much harder than a spelling corrector since a user might have a name that is misspelled on purpose
Dynamodb or redis is an incorrect tool for your requirements.
I would recommend using dyanmodb or redis for your datastore, and use Solr or ElasticSearch ( or their AWS version Amazon CloudSearch, which provides both solr and elasticsearch)
You can store your user profiles in dynamodb, and store searchable fields in your search store ( you can even store full profiles in search store).
Then search functionalities like spelling errors, ranking friends based on some score are easy to implement.
I'm kinda new to neo4j, and I want to start building an application with neo4j and nodejs.
From what I understand neo4j adds id to each node it creates, and this id should not be use outside the db, So that means if I have users then looking user by id (the same id neo give when created that user) is not that smart.
So the questions are:
What will be the best why to look for a user? email? could be, but not every thing in the application has a unique identifier like email for a user
I saw few posts writing about uuid, lets assume I'm using it.. can i save save that field with name id? or i need some other name?
Do I need to do something special if I want that field to use as an index? (I want the search by id to be fast.)
uuid generates a very long string, Isn't that a bit of overhead to index that string? indexing a number is faster, no?
if not using uuid what you think is other option?
1) Best way - UUID.
2) Yes.
3) No. You just need to add index to database. Example:
CREATE INDEX ON :User(uuid)
4) That's true that id lookup is faster, especially in Neo4j (due to storage implementation). However index-backed lookup using UUID performs very well and most of the Neo4j users are using this (if there are no another unique identifier in their domain).
5) UUID is the best option. Especially when you take in account - how to generate ID's in clustered setup. UUID's provides possibility to generate unique identifier without taking any global database locks and etc. Here you can read a bit more theoretic information about UUID's.
There are existing Neo4j extensions, which can generate UUID's for you.
For example - GraphAware/neo4j-uuid.
In this extension you can configure property name, for which nodes/relationships UUID's should be applied and etc.
I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.
So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...