Look ahead search on document fields in azure DocumentDb - azure

We are interested in using DocumentDb as a data store for a number of data sources and as such we are running a quick POC to establish whether it meets the criteria we are looking for.
One of the areas we are keen to provide is look ahead search capabilities for certain fields. These are traditionally provided using the SQL LIKE syntax which does not appear to be supported at present.
Searching online I have seen people talking about integrating Azure search but this appears to be a very costly mechanism for such a simple use case.
I have also seen people mention the use of UDF's but this appears to require an entire collection scan which is not practical from a performance perspective.
Does anyone have any alternative suggestions? One thing I considered was simply using a SQL table and initiating an update each time a document was inserted\updated\deleted?

DocumentDB supports STARTSWITH and range indexes to support prefix/look ahead searching.
You can progressively make queries like the following based on what your user types in a text box:
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "H")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hi")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hil")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hilton")
Note that you must configure the collection, or the path/property you're using for these queries with a range index. You can extend this approach to handle additional cases as well:
To query in a case-insensitive manner, you must store the lower case form of the search property, and use that for querying.

I faced a similar situation, where a fast lookup was required, as a user typed search terms.
My scenario was that potentially thousands of simultaneous users would be performing such lookups; when testing this under load, to avoid saturation and throttling, we found we would have to increase the DocumentDB Request Unit (RU) throughput amount to a point that was not financially viable for us, in our specific circumstances.
We decided that DocumentDB was best used as the persistent store, and 'full' data retrieval - and this role it performs exceptionally well - while a small ElasticSearch cluster performed the role it was designed for - text search, faceted search, weighted search, stemming, and most relevant to your question, autocomplete analyzersand completion suggesters.
The subject of type ahead queries, creation of indexes, autocomplete analyzer and query time 'search as you type' in ElasticSearch can be found here, here and here
The fact that you plan to have several data sources would also potentially make the ElasticSearch cluster approach more attractive, to aggregate search data.
I used the Bitnami template available in the Azure market place to create relatively small instances, and most importantly, this allowed me to place the cluster on the same Virtual Network as my other components, which greatly increased performance.
Cost was lower than Azure Search (which uses ElasticSearch under the hood).

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

Azure Search- replicating result of nested SQL query

I have a database comprising of the following schema depicting the linkage between individuals who connect with multiple Advisors and these Advisors have affiliations with multiple organizations
Individuals--> Advisors (m:n relationship)
Advisors --> Enterprises (m:n relationship)
The business need is to enable search on all these concepts and organize results around AdvisorIds. As an example, display of a search result could be as follows
a) Advisor1-> connected to Individuals A,B,C; and linked to Enterprises X,Y
b) Advisor2-> connected to Individuals A, E; and linked to Enterprises M,X,Z
Towards this, we created a flattened table on these concepts and the relationship between them. Hence the same AdvisorId would appear in multiple rows
When I search for a string, I want to ensure that ALL records around an AdvisorId to be returned together irrespective of search score of the individual records.
One approach could be
a) first run an Azure Search and get a result of AdvisorId, ordered by search score of each record. This will repeat Advisor Ids
b) take a distinct set of AdvisorIds (across pages) via standard SQL
c) for each AdvisorId, pick all the related records via standard SQL
2 questions
Here a lot of processing in (b) and (c) will be done outside Azure leading to delays. Also, if I were to use pagination for (a), I am never sure of number of AdvisorId's, I end up with after the distinct operation
I wanted to check if there is a way to get the nested search implemented in Azure to do (a), (b) and (c) as a single API call
If I were to use facets for handling (a) and (b) together, how do I ensure that the ordering is based on the best search-score document within a facet
There isn't a way to achieve what you want in a single request unless you model your data differently. Instead of denormalizing the Individuals-Advisors and Advisors-Enterprises relationships, it may be possible instead to have one document per Advisor and use collections to store information about the related Individuals and Enterprises. This may or may not work for you depending on whether you need to supported correlated filtering on Individuals and Enterprises that are related to an Advisor. There is a whitepaper here that should help you evaluate whether this approach would work for you.
Another option might be to model Individuals, Advisors, and Enterprises as separate indexes, issue three queries, and do a client-side join. However, this is limited by the number of Advisor IDs you'd need to send in the queries on Individuals and Enterprises. Azure Search has limits on the size of filters that can make this impractical unless your queries have low recall.
We are working on making Azure Search better for scenarios like yours. For example, we're currently working on adding support for complex types. Please vote on User Voice and feel free to suggest other features that would help.

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

How does solr work with data split into different services and therefore not synchronously available?

take for instance an ecommerce store with catalog and price data in different web services. Now, we know that solr does not allow partial updates to a document field(JIRA bug), so how do you index these two services ?
I had three possibilities, but I'm not sure which one is correct:
Partial update - not possible
Solr join - have price and catalog in separate index and join them in solr. You cant join them in your client side code, without screwing up pagination and facet counts. I dont know if this is possible in pre-solr 4.0
have some sort of intermediate indexing service, which composes an entire document based on the results from both these services and sends this for indexing. however there are two problems with this approach:
3.1 You can still compose documents partially, and then when the document is complete, you can set a flag indicating that this is a complete document. However, to do this each time a document has to be indexed, it has to first check whether the document exists in the index, edit it and push it back. So, big performance hit.
3.2 Your intermediate service checks whether a particular id is available from all services - if not silently drops it and hopes that when it appears in the other service, the first service will already be populated. This is OK, but it means that an item is not available in search until all fields are available (not desirable always - if u dont have price, you can simply set it to out-of-stock and still have it available)
Of all these methods, only #3.2 looks viable to me - does anyone know how you do this kind of thing with DIH? Because now, you have two different entry points (2 different web services) into indexing and each has to check the other
The usual way to solve this is close to your 3.2: write code that creates the document you want to index from the different available services. The usual flow would be to fetch all the items from the catalog, then fetch the prices when indexing. Wether you want to have items in the search from the catalog that doesn't have prices available depends on your business rules for the service. If you want to speed up the process (fetch product, fetch price, repeat), expand the API to fetch 1000 products and then prices for all the products at the same time.
There is no reason why you should drop an item from the index if it doesn't have price, unless you don't want items without prices in your index. It's up to you and your particular need what kind of information you need to have available before indexing the document.
As far as I remember 4.0 will probably support partial updates as it moves to the new abstraction layer for the index files, although I'm not sure it'll make your situation that much more flexible.
Approach 3.2 is the most common, though I think about it slightly differently. First, think about what you want in your search results, then create one Solr document for each potential result, with as much information as you can get. If it is OK to have a missing price, then add the document that way.
You may also want to match the documents in Solr, but get the latest data for display from the web services. That gives fresh results and avoids skew between the batch updates to Solr and the live data.
Don't hold your breath for fine-grained updates to be added to Solr and Lucene. It gets a lot of its speed from not having record-level locking and update.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources