Date function and Selecting top N queries in DocumentDB - azure

I have following questions regarding Azure DocumentDB
According to this article, multiple functions have been added to
DocumentDB. Is there any way to get Date functions working? How can i
get the queries of type greater than some date working?
Is there any way to select top N results like 'Select top 10 * from users'?
According to Document playground , Order By will be supported in future. Is ther any other way around for now?
The application that I am developing requires certain number of results to be displayed that have been inserted recently. I need these functionalities within a stored procedure. The documents that I am storing in DocumentDB have a DateTime property. I require the above mentioned functionalities for my application to work. I have searched at documentation and samples. Please help if you know of any workaround.

Some thoughts/suggestions below:
Please take a look at this idea on how to store and query dates in DocumentDB (as epoch timestamps). http://azure.microsoft.com/blog/2014/11/19/working-with-dates-in-azure-documentdb-4/
To get top N results, set FeedOptions.MaxItemCount and read only one page, i.e., call ExecuteNextAsync() once. See https://msdn.microsoft.com/en-US/library/microsoft.azure.documents.linq.documentqueryable.asdocumentquery.aspx for an example. We're planning to add TOP to the grammar to make this easier in the future.
You can email me at arramac at microsoft dot com to get early access to Order By right away. This is planned for broad release shortly.
Please note that stored procedures are best used when you have a write operation(s). You'll be able to better throughput on reads when you query directly.

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.
Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

Look ahead search on document fields in azure DocumentDb

We are interested in using DocumentDb as a data store for a number of data sources and as such we are running a quick POC to establish whether it meets the criteria we are looking for.
One of the areas we are keen to provide is look ahead search capabilities for certain fields. These are traditionally provided using the SQL LIKE syntax which does not appear to be supported at present.
Searching online I have seen people talking about integrating Azure search but this appears to be a very costly mechanism for such a simple use case.
I have also seen people mention the use of UDF's but this appears to require an entire collection scan which is not practical from a performance perspective.
Does anyone have any alternative suggestions? One thing I considered was simply using a SQL table and initiating an update each time a document was inserted\updated\deleted?
DocumentDB supports STARTSWITH and range indexes to support prefix/look ahead searching.
You can progressively make queries like the following based on what your user types in a text box:
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "H")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hi")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hil")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hilton")
Note that you must configure the collection, or the path/property you're using for these queries with a range index. You can extend this approach to handle additional cases as well:
To query in a case-insensitive manner, you must store the lower case form of the search property, and use that for querying.
I faced a similar situation, where a fast lookup was required, as a user typed search terms.
My scenario was that potentially thousands of simultaneous users would be performing such lookups; when testing this under load, to avoid saturation and throttling, we found we would have to increase the DocumentDB Request Unit (RU) throughput amount to a point that was not financially viable for us, in our specific circumstances.
We decided that DocumentDB was best used as the persistent store, and 'full' data retrieval - and this role it performs exceptionally well - while a small ElasticSearch cluster performed the role it was designed for - text search, faceted search, weighted search, stemming, and most relevant to your question, autocomplete analyzersand completion suggesters.
The subject of type ahead queries, creation of indexes, autocomplete analyzer and query time 'search as you type' in ElasticSearch can be found here, here and here
The fact that you plan to have several data sources would also potentially make the ElasticSearch cluster approach more attractive, to aggregate search data.
I used the Bitnami template available in the Azure market place to create relatively small instances, and most importantly, this allowed me to place the cluster on the same Virtual Network as my other components, which greatly increased performance.
Cost was lower than Azure Search (which uses ElasticSearch under the hood).

Dynamic queries with ArangoDB

I am looking to write dynamic queries for an ArangoDB graph database and am wondering if there are best practices or standard approaches to doing it.
By 'dynamic queries' I mean that users would have the ability to build a query that is then executed on the dataset.
Methods that ArangoDB can support this could include:
Dynamically generate AQL queries by manually injecting bindvars
Write Foxx functions to deliver on supported queries, and have another Foxx function bind those together to build a response.
Write a workflow which extracts data into a temporary collection and then invokes Foxx functions to filter/sort the data to the desired outcome.
The queries would be very open ended, where someone would (for example):
Query all countries with population over 10,000,000
Sort countries by land in square kilometers
Pick the top 10 countries in land coverage
Select primary language spoken in each country
Count occurrences of each language.
That query alone is straight forward to execute, but if a user was able to [x] check or select from a range of supported query options, order them in their own defined way, and receive the output, it's a little more involved.
Are there some supported or recommended approaches to doing this?
My current approach would be to write blocks of AQL that delivered on each part, probably in a LET Q1 = (....), LET Q2 = (...) format, and then finally in the bottom of the query have a generic way of processing the queries to generate a response.
But I have a feeling that smart use of Foxx functions could help here as well, having Foxx-Query-Q1 and Foxx-Query-Q2 coded to support each query type, then an aggregation Foxx app that invoked the right queries in the right order to build the right response.
If anyone has seen best ways of doing this, it would be great to get some hints/advice.
Thanks!

Is there any way to skip rows when I retrieve from Azure table storage?

I believe in the past the answer to this question was no. However has anything changed with the recent releases or does anyone know of a way that I can do this. I am using datatables and would love to be able to do something like skip 50 retrieve 50 rows. skip 100 retrieve 50 rows etc.
It is still not possible to skip rows. The only navigation construct supported is top. The Table Service REST API is the definitive way to access Wndows Azure Storage, so its documentation is the go-to location for what is or is not possible.
What you're asking here is possible using continuation tokens. Scott Densmore blogged about this a while ago to explain how you can use continuation tokens for paging when you're displaying a table (like what you're asking here with DataTables): Paging with Windows Azure Table Storage. The blog post shows how to show pages of 3 items while using continuation tokens to move forward and back between pages:
Besides that there's also Steve's post that describes the same concept: Paging Over Data in Windows Azure Tables
Yes (kinda) and no. No, in the sense that the Skip operation is not directly supported at the REST head. You could of course do it in memory, but that would defeat the purpose.
However, you can of course actually do this pattern if you structure your data correctly. We do something like this ourselves. We align our partition key to the datetime and use the RowKey as a discriminator. This means we can always pinpoint the partition range we are interested in and then Take() some amount of data. So, for example, we can easily Take() the first 20 rows per hour by specifying a unique query (skipping over data we don't want). The partion key is simply aligned per hour and then we optionally discriminate further using the RowKey - finally, we just take data. When executed in parallel, this works just dandy.
Again, the more technically correct answer is NO. However, you can approximate it cleverly using the PK and RK.

Can I query any attribute in a Windows Azure Tablestorage row?

Sorry if this sounds like a rather dumb question but I would like to do a "select" on data from a Windows Azure table. I tried the following and it worked:
from question in _statusTable.GetAll()
where status.RowKey.StartsWith(name)
I then tried
from question in _statusTable.GetAll()
where status.Description.StartsWith(name)
This one gave me nothing. Can anyone explain to me if or how I can query on rows that are not part of the RowKey or PartitionKey.
You can query on any property, but the types of query supported are limited - e.g. StartsWith isn't supported. Also if you aren't querying on PartitionKey and RowKey, then there are some very important performance issues to understand - and you always need to be aware of ContinuationToken's - almost any query result can contain these.
You can see the sorts of queries supported by looking at the REST API: http://msdn.microsoft.com/en-us/library/dd894031.aspx - it's pretty limited (but quick as a result):
Equal
GreaterThan
GreaterThanOrEqual
LessThan
LessThanOrEqual
NotEqual
If you need to do more, then:
you can mimic things like StartsWith("Fred") by doing a GreaterThanOrEqualTo("Fred") and LessThan("Free")
or client side filtering will work - but that means pulling back all the rows from the storage - which could be a lot of data and which could be computationally and transactionally expensive!
What does GetAll() do? StartsWith isn't supported by WA tables, so I'm assuming GetAll pulls all the data local, and so your query is done over objects in memory. If so, this has nothing to do with Windows Azure, so I'd take a look at whether your data looks like you expect it to.

Resources