graph traversal performance of azure Cosmos DB - azure

Cosmos DB allows us to store graph data using the gremlin query language.
Are there intelligent algorithms optimizing how the graph is split up among many servers? If not I can imagine some queries being extremely slow do to network latency between the shards.

The documentation is still a bit lacking, but there are some performance considerations for DocumentDb itself. Namely, setting up a PartitionKey that is adequately granular will split your data across multiple partitions, giving you higher throughput. You can find more here:
https://learn.microsoft.com/en-us/azure/documentdb/documentdb-partition-data

Related

Does cosmosdb stores data differently depending on which api you use?

What is the ramification of using each api?
For example if I am using sql api, am I sacrificing ACID, and which part of CAP am I using? How do azure achieve horizontal scalability
If I am using document api or key value api, is the internal data layout different?
The internal storage format for the various database API's in Cosmos DB doesn't have any bearing on ACID or CAP. The API you choose should be driven by the appropriate use case needed and/or your familiarity with it. For instance, both SQL API and Mongo DB API are document databases. But if you have experience using Mongo, then it's probably a better choice rather than say Gremlin which is a graph, Table which is a key/value store, or Cassandra which is a columnar store.
Cosmos DB provides ACID support for data within the same logical partition. With regards to CAP that depends on which consistency level/model you choose. Strong consistency trades "A" availability for "C" consistency. All the other consistency models trade consistency for availability but at varying degrees. For instance, bounded staleness defines an upper bound that data can lag in consistency by time or updates. As that boundary approaches, Cosmos will throttle new writes to allow the replication queue to catch up, ensuring the consistency guarantees are met.
Achieving horizontal scalability is a function of your partitioning strategy. For write heavy workloads, the objective should be choosing a partition key that will allow for writes to be distributed across a wide range of values. For read heavy workloads, you want for queries to be served by one or a bounded number of partitions. If it's both, then using change feed to copy data into 2 or more containers as needed such that the cost of copying data is cheaper than running it in a container that would result in cross-partition queries. At the end of the day, choosing a good partition key requires testing under production levels of load and data.

Is Star Schema (data modelling) still relevant with the Lake House pattern using Databricks?

The more I read about the Lake House architectural pattern and following the demos from Databricks I hardly see any discussion around Dimensional Modelling like in a traditional data warehouse (Kimball approach). I understand the compute and storage are much cheaper but are there any bigger impacts in terms of queries performance without the data modelling? In spark 3.0 onwards I see all the cool features like Adaptive Query Engine, Dynamic Partition Pruning etc., but is the dimensional modelling becoming obsolete because of that? If anyone implemented dimensional modelling with Databricks share your thoughts?
Not really a question for here, but interesting.
Of course Databricks et al are selling their Cloud solutions - I'm fine with that.
Taking this video https://go.incorta.com/recording-death-of-the-star-schema into account - whether paid for or the real opinion of Imhoff:
The computing power is higher at lower cost - if you manage it and you can more things on the fly.
That said, the same could be stated with SAP Hana, where you do ETL on the fly. I am not sure why every time I would want to have a virtual creation of a type 2 dimension.
Star schemas require thought and maintenance, but show focus. Performance is less of an issue.
It is true that ad hoc queries do not work well with star schemas over multiple fact tables. Try it.
Databricks has issues with sharing Clusters with SCALA, if you do it their way with pyspark it is OK.
It remains to be seen if querying via Tableau works well on Delta Lake - I need to see it for myself. In the past we had thrift server etc. for this and it did not work, but things are different now.
Where I am now we have Data Lake on HDP with delta format - and a
dimensional SQL Server DWH. The latter due to the on-premises aspects
of HDP.
Not having star schemas means people need more skills to query.
If I took ad hoc querying then I would elect the Lakehouse, but
actually I think you need both. It's a akin to the discussion do you
need ETL tools if you have Spark.
The Kimball's star schema and Data Vault modeling techniques are still relevant for Lakehouse patterns, and mentioned optimizations, like, Adaptive Query Execution, Dynamic Partition Pruning, etc., combined with Data Skipping, ZOrder, bloom filters, etc. are making queries very efficient.
Really, Databricks data warehousing specialists recently published two related blog posts:
Data Warehousing Modeling Techniques and Their Implementation on the Databricks Lakehouse Platform: Using Data Vaults and Star Schemas on the Lakehouse
Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform
In our use case we access the lakehouse using PowerBI + Spark SQL and being able to significantly reduce the data volume the queries return by using the star schema makes the experience faster for the end-user and saves compute resources.
However considering things like the columnar nature of parquet files and partition pruning which both also decrease the data volume per query, I can imagine scenarios in which a reasonable setup without star schema could work.

Azure SQL Data Warehouse DWU vs Azure SQL DTU

I am considering migration from Azure SQL to Azure SQL Data Warehouse. It seems to offer some of the features that we need, however price is a concern for starting small. 100 DWU Data Warehouse is priced considerably higher ($521/month) than a seemingly comparable 100 DTU Azure SQL S2 tier ($150/month).
To make sure I am comparing apples to apples, can someone shed some light on how DWU compare to DTU (assuming basic configuration with a single database)?
Edit: to everyone who is inclined to answer that Azure SQL DW and Azure SQL are not comparable and therefore it makes no sense to compare DTU to DWU: then how does it make sense to (talk about migration) to DW?
For what it's worth, 1 DWU = 7.5 DTU with respect to server capacity
When you look at the server instance that you provision a DW instance on:
100 DWU instance consumes 750 DTUs of server capacity
400 DWU instance consumes 3,000 DTUs of server capacity
While this information is interesting, it may not be very useful in terms of comparing pricing because DW pricing is exclusively based on DWU, while Azure SQL pricing is the combination of DTU and database size.
You can't and really shouldn't compare the two for the same workload; they're designed for different things based on completely different architectures. As such, DTU and DWU are not comparable measures. Also, how deeply have you looked into the technical differences? The high level features are not the major issue, details are what might wreck your app (e.g. can you live with a limited TSQL surface area or transaction isolation level?)
Azure SQL DB is intended to be a general purpose DB as a service. The few feature gaps aside, you should think about Azure SQL DB functionally the same way you do SQL Server, minus a lot of the administrative tasks and with a different programming model. Works great for OLTP apps and most reporting apps (or mixed) but not so great for complex analytical apps against with very large datasets (can't really store that much in SQL DB anyway).
SQL DW is intended for data warehousing, analytical type workloads. Its MPP architecture is particularly well suited for complex queries against very large data sets. It will not perform well for typical OLTP applications that have lots of small or singleton queries especially when it's a mix of insert, update and delete operations. If you get a trial instance of SQL DW, you can easily test and verify the behavior for your workload compared to what it currently looks like on SQL DB.
SQL DW also has some limitations on its TSQL surface area, types, concurrency, isolation levels (deal breaker for almost all OLTP apps), etc... so be sure to look into the documentation to get the whole picture as you evaluate feasibility. It might work great but I suspect it's not the best solution if you're running an OLTP workload. Reporting/analytical type workloads however might find a happy home in SQL DW.
The best way to figure out what you need is to look at your current IO requirements. Data Warehouses tend to be IO hogs and consequently are optimized by maximizing IO throughput. The DWU Calculator site walks you through the process of capturing a your disk metrics and estimates how many DWUs you need to fulfill your workload.
http://dwucalculator.azurewebsites.net/

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

Key differences between Azure DocumentDB and Azure Table Storage

I am choosing database technology for my new project. I am wondering what are the key differences between Azure DocumentDB and Azure Table Storage?
It seems that main advantage of DocumentDB is full text search and rich query functionality. If I understand it correctly, I would not need separate search engine library such as Lucene/Elasticsearch.
On the other hand Table Storage is much cheaper.
What are the other differences that could influence my decision?
I consider Azure Search an alternative to Lucene. I used Lucene.net in a worker role and simply the idea of not having to deal with the infrastructure, ingestion, etc.. issues make the Azure Search service very appealing to me.
There is a scenario I approached with Azure storage in which I see DocumentDB
as a perferct fit, and it might explain my point of view.
I used Azure storage to prepare and keep daily summaries of the user activities in my solution outside of Azure SQL Database, as the summaries are requested frequently by a large number of clients with good chances to experience spikes on certain times of the day. A simple write once read many scenario usage pattern (my schema) Azure SQL db found it difficult to cope with while it perfectly fit the capacity of storage (btw daily summaries were not in cache because of size) .
This scenario evolved over time and now I happen to keep more aggregated and ready to use data in those summaries, and updates became more complex.
Keeping these daily summaries in DocumentDB would make the write once part of the scenario more granular, updating only the relevant data in the complex summary, and ease the read part, as the capability of getting parts of more summaries becomes a trivial quest, for example.
I would consider DocumentDB in scenarios in which data is unstructured and rather complex and I need rich query capability (Table storage is lagging on this part).
I would consider Azure Search in scenarios in which a high throughput full-text search is required.
I did not find the quotas/expected perf to precisely compare DocumentDB to Search but I highly suspect Search is the best fit to replace Lucene.
HTH, Davide

Resources