Azure SQL Database Autotuning - Develop without worrying about indexes? - azure

Azure's SQL database feature for auto-tuning creates and drops indexes based on database usage. I've imported an old database into Azure which did not have comprehensive indexes defined and it seems to of done a great job on reducing CPU & DTU usage over a relatively short period of time.
It feels wrong - but does this mean I can develop going forwards without defining indexes? Does anyone do this? SSMS index editor is a pain and slow to modify with. Not having to worry/maintain indexes would speed up development time.

The auto-tuning is taking advantage of three things, Query Store, Missing Indexes and Machine Learning.
First, the last known good plan is a weaponization of the Query Store. This is in both Azure and SQL Server 2017. It will spot queries that have degraded performance after a plan change (and quite a few executions, not just one) and will revert back to that plan. If performance degrades, it turns that off. This is great. However, if you write crap code or have bad data structures or out of date statistics, it doesn't help very much.
The automatic indexes in Azure are using two things, missing index suggestions from the optimizer and machine learning on Azure. With those, if the missing index comes up a lot over a period of 12-18 hours (read this blog post on automating it), you'll get an index suggestion. It measures for another 12-18 hours and if that index helped, it stays, if not, it goes. This is also great. However, it suffers from two problems. First, same as before, if you have crap code, etc., this will only really help at the margins. Second, the missing index suggestions from the optimizer are not always the best index. For example, when I wrote the blog post, it identified a missing index appropriately, but it missed the fact that an INCLUDE column would have been even better than the index it suggested.
A human brain and eyeball is still going to be solving the more difficult problems. These automations take away a lot of the easier, low-hanging problems. Overall, that's a great thing. However, don't confuse it with a panacea for all things performance related. It's absolutely not.

Related

Understanding Cassandra - can it replace RDBMS? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I've spent the last week cramming on Cassandra, trying to understand the basics, as well as if it fits our needs, or not. I think I understand it on a basic level at this point, but if it works like I believe I'm being told...I just can't tell if it's a good fit.
We have a microservices platform which is essentially a large data bus between our customers. They use a set of APIs to push and pull shared data. The filtering, thus far, is pretty simple...but there's no way to know what the future may bring.
On top of this platform is an analytics layer with several visualizations (bar charts, graphs, etc.) based on the data being passed around.
The microservices platform was built atop MySQL with the idea that we could use clustering, which we honestly did not have a lot of luck with. On top of that, changes are painful, as is par for the course in the RDBMS world. Also, we expect extraordinary amounts of data with thousands-upon-thousands of concurrent users - it seems that we'll have an inevitable scaling problem.
So, we began looking at Cassandra as a distributed nosql potential replacement.
I watched the DataStax videos, took a course on another site, and started digging in. What I'm finding is:
Data is stored redundantly across several tables, each of which uses different primary and clustering keys, to enable different types of queries, since rows are scattered across different nodes in the cluster
Rather than joining, which isn't supported, you'd denormalize and create "wide" tables with tons of columns
Data is eventually consistent, so new writes may not be readily readable in a predictable, reasonable amount of time.
CQL, while SQL-like, is mostly a lie. How you store and key data determines which types of queries you can use. It seems very limited and inflexible.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs. If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
I want to like this idea and love the distributed features, but frankly am mostly scared off, at this point. I feel like I've learned a lot and nothing at all, in the last week, and am entirely unsure how to proceed.
I looked into JanusGraph, Elassandra, etc. to see if that would provide a simpler interface on top of Cassandra, relegating it to basically a storage engine, but am not confident many of these things are mature enough or even proper, for what we need.
I suppose I'm looking for direction and insight from those of you who have built things w/ Cassandra, to see if it's a fit for what we're doing. I'm out of R&D time, unfortunately. Thanks!
Understanding Cassandra - can it replace RDBMS?
The short answer here, is "NO." Cassandra is not a simple drop-in replacement for a RDBMS, when you suddenly need it to scale.
While these concepts make sense to me, I'm struggling to see how this would fit most long-term database needs.
It fits long-term database needs if you're applying it to the right use case.
DISCLAIMER: I am a bit of a Cassandra zealot. I've used it for a while, made minor contributions to the project, been named a "Cassandra MVP," and even co-authored a book about it. I think it's a great piece of tech, and you can do amazing things with it.
That being said, there are a lot of things that it's just not good at:
Query flexibility. The tradeoff you make for spreading rows across multiple nodes to meet operational scale, is that you have to know your query patterns ahead of time, and then follow them strictly. The idea, is that you want to have all queries served by a single node. And you'll have to put some thought into your data model to achieve that. Unbound queries (SELECTs without WHERE clauses) become the enemy.
Updating data in-place. Plan on storing values by a key, but then updating them a lot (ex: status)? Cassandra is not a good fit for that. This is because Cassandra has a log-based storage engine which doesn't overwrite anything...it just obsoletes it. So your previous values are still there, and still take up space and compute resources.
Deleting Data. Deleting data in the distributed database world is tricky. After all, how do you replicate nothing to another node? Cassandra's answer to that problem, is to use a structure called a tombstone. Tombstones take up space, can slow performance, and need to stay around long enough to replicate (making their removal tricky).
Maintaining Data Consistency. Being highly-available and partition tolerant, Cassandra embraces the concept of "eventual consistency." So it should come as no surprise that it really wasn't designed to be consistent. It has a lot of mechanisms which will help keep data consistent, but they are far from perfect. Plus, there really isn't a way to know for sure if your data is in sync or not.
If data is redundant across several different tables...how is it managed and kept consistent across those many tables? Are materialized views the answer in this case?
Materialized views are something that I'd continue to stay away from for the foreseeable future. They're "experimental" for a reason. Basically, once they're out of sync, the only way to get them back in sync is to rebuild them.
I coach my dev teams on keeping their query tables (tables containing the same data, just keyed differently) in sync with BATCH statements. In fact, BATCH is a misnomer as it probably should have bene named "ATOMIC" instead. Because of its name, it is heavily mis-used, and its mis-use can lead to problems. But, it does keep mutations applied atomically, so that does help.
Basically, scrutinize your database requirements. If Cassandra doesn't cut it, then try to find one which does. CockroachDB (or one of the other NewSQLs) might be a better fit for what you're talking about. It tries to be a drop-in for Postgres, and it scales with some Cassandra-like mechanisms, so it might be worth looking into.
Cassandra is very good at what it does but it is not a drop-in replacement for an RDBMS. If you find that you need any of the following, I would not encourage you to migrate to Cassandra:
Strict consistency
ACID transactions
Support for ad-hoc queries, including joins, aggregates, etc.
Now as for you hitting some limits (or thinking you will hit them in the future) with MySQL, here are some thoughts:
Don't think that a limitation in MySQL is a limitation in RDBMS in general. Just so you don't think I am a $some_other_DB zealot, I've been using MySQL for almost 20 years, but it is not the best tool for all jobs.
If by 'changes' you mean 'schema changes', a lot of the pain can be alleviated by either:
Using an RDBMS where they are implemented better (including perhaps a more recent MySQL version)
Using community supported tools such as pt-online-schema-change or gh-ost
Good luck!

Cassandra Data Model for apache access logs

In a POC, we are using cassandra for storing (among other things) Apache access logs (parsed) and use together with apache spark + zeppelin. We have managed to get things working BUT we are very uncertain about how to model the data correctly.
Edit: Our queries will span over months and years rather than weeks and days. Against production jobs are likely executed perhaps daily (at least for now) and we will use a smaller dataset during development.
Since this will be used for analytics ONLY, the queries can be pretty much anything but of course we could consider a handful of queries in advance.
I.e
latency percentiles
geo distribution
sum of requests
Popular rest resources
... etc
Partition key + Primary key. This is really difficult... the only thing that I can think of is something like ((userid, [webresource]), timestamp).
At least this would give a fairly even distribution. Otherwise we would have to use a checksum or something which feels wrong.
Or should I have different tables for different types, like latency, geo etc? Or is this a good option for materialized views?
I have googled for something like this without any luck so perhaps cassandra is a poor solution for this BUT still, we would really like to see how far we can get.
Anyway, any input is highly appreciated!
Regards /Johan

Cassandra or mongodb or something else for big online sales site

Currently we are using mongodb as our primary store for big online sales site, and currently we are focusing ourselves on big scalability among multiple machines.
Site backend is written in node.js and we are using mongoose as ODM.
I can see many blog posts which are writing about awesome cassandra DB, and I am starting to think about switching to cassandra. But still I am not sure if this is a really good decision, because I didn't found any good ODM/ORM lib for cassandra and node.js (and writing raw queries can be pain. Also writing good tested ORM/ODM can be time consuming task). So I am not sure how much benefit will I have after this switch. We are using elasticsearch as search engine, and it works excellent in combination with mongodb, and I am asking my self will do also good with cassandra.
If you have any experiance with this, it will be very helpfull.
Thank you!
Cassandra is a very nicely designed database, which can fulfill a lot of scenarios. MongoDB is also a really good DB engine. So let me just compare couple of main bullet points for you.
Always on system
Cassandra is really great when you need to provide 24x7 operations in multiple data centers. If you got more then one datacenter with multiple servers in each of them then Cassandra is great for you. Cassandra can sync writes to more than one datacenter and maintain desired data consistency across complex set ups. Recovery and re-sync is also quite easy.
On the other note MongoDB is easy to operate. If you got one data center and only couple of servers it might be a perfect fit (although global write lock might be a pain over time). In simple deployments it's easy to maintain and monitor.
Scalability
To continue the above statements - Cassandra is linearly scalable. There is, literally, no limit of how big the cluster will be. Your writes will always stay fast, while reads might become more complicated over time - depending on the structure of your data.
Denormalization of data
With Cassandra your writes and reads can be extremely fast if you will create a structure that will reflect what you need to get from your data. There is no query language (well, there is, but it's not exactly SQL) that you can use to reorganize your result set using aggregates, groupings, etc. Yes, some things are doable and some not - that is very specific to Cassandra data model. You will have to implement a lot of things on your own and write the result to the DB - i.e. counters for aggregation, different groupings, etc.
In comparison MongoDB is easy to use, easier to learn and more flexible - both for development (as knowledge curve/efforts goes) and for implementation of business logic (as time/effort is considered). That is - kind of - a reason why there are ORM engines for MongoDB and only couple (very limited) for Cassandra.
To summarize - both DBs are really good... if you will embrace their limitations. If you got only 100GB of data and you need flexible, easy to implement DB engine I would stick to MongoDB, alternatively take a look RethinkDB which have a very similar model and way better (in my personal opinion) clustering/data center replication implementation.
Cassandra is a great option for you if you will need to store TBs of data soon, deploying your apps across multiple data centers while accepting the cost of additional efforts to implement the same features and maintaining similar capabilities.
Don't take it personally that I have used the word only while describing your data set. Yes, it's not big - my company stores more than 20 TB these days... so yeah, 100GB is really not that much...
To stop everyone from pointing that I should compare some other features or point out some other differences between those two - it's just a rough, high level overview on the things I consider relevant to the problem, not a full comparison or analysis of the problem. But feel free to point out what I have missed and I will be happy to include new stuff in this answer...

MongoDb for collection of production data

I am facing a new type of problem that I haven't tried tackling before. So I would like some pointers in the right direction by someone more knowledgeable than I :-)
I have been asked by a friend to help him design a control system for production line. The project sounds really interesting, and I can't stop thinking about it.
I have already found that I can control the system using a node.js server. So far so good (HTML5 interface here we come)! But where I really want this system to stand out is in the collection of system metrics. The system reports all kinds of things such as temperature, flow etc, and these metrics are reported up to several hundred times per second per metric... and this runs 24/7.
My thought is to persist this in a MongoDb database, and do some realtime statistics on this. The "competition", if you will, seems to save this in a SQL server database and allow the operators to export aggregated data to Excel, and do statistics in Excel.
What are the strategies for doing real time statistics using a MongoDb?
I would really like to provide instant feedback and monitoring based on these metrics. Such as average temperature over the last 24 hours, spikes etc, and also enable alerts. There will not be much advanced statistics done on the server. If that is needed, I would enable export of data to a program such as SPSS.
Is MongoDb a good fit for that? I would love to use a Linux machine instead of a Windows machine with SQL Server and a WinForms Control Interface. The license fees alone are enough to put me off, although I know it probably isn't the case for the people buying the machinery.
This will not be placed in the cloud, but rather on a single server on the network. Next to the machine being operated, I will place a touch interface that through a browser will contact the node.js server to invoke PLC commands. There can be multiple machines that need controlling, and they would all be controlled by the same central node.js server.
The machinery is controlled by PLC controllers from http://beckhoff.com/.
I am not a complete novice when it comes to MongoDb, but I have never put anything I have made into production, and I wouldn't put MongoDb on my CV... yet!
EDIT: It seems that the $inc operator is the way to go. But what if I wan't both the daily and hourly averages as well as a continuous feed that updates a chart on screen with data every second using socket.io. Is is a good idea to update a document for each of the aggregates I need. I really also want to save every measurement, but maybe I could aggregate that on a per second basis, so I don't store up to a 1000 records per second per metric?
MongoDB can definitely be used for your scenario. Look at http://www.slideshare.net/pstokes2/social-analytics-with-mongodb, http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ or
Real-time statistics: MySQL(/Drizzle) or MongoDB? for more on this topic
What I am really looking for is the Aggregation Framework: http://docs.mongodb.org/manual/tutorial/aggregation-examples/
That gives me exactly the kind of stats that I would like to see. Use this to calculate sums and averages as I write, and then also allow for ad-hoc queries should they be needed.
For a little insight on performance, read this awesome blogpost!
http://devsmash.com/blog/mongodb-ad-hoc-analytics-aggregation-framework
Also, anyone else looking to do something like this should take a look at this to see how to save the individual events. I don't need to save data longer than a week for example, so a rolling log should be more than enough for me: http://blog.mongodb.org/post/172254834/mongodb-is-fantastic-for-logging
With this I am very close to having a really sweet setup, and I am beginning to feel confident that this is a good choice over MySQL or MSSQL.

Is NoSQL ideal to store stats?

I'm not terribly familiar with NoSQL systems, but I remember reading a while back that they are ideal to handle statistical data.
Since I'm about to start writing code that will record data like "how many users were registered on each day", I was thinking I could use this as an opportunity to learn more about NoSQL if it fits the bill.
If NoSQL is indeed ideal for this, could you provide me with some information as to why? And which specific systems are best suited for this particular need?
So, after the first answer, maybe it's helpful to clarify a bit more.
I currently have a PostgreSQL database from which I'll get the data. It will be very simple, and no calculations needed. For example, I'll just get a resultset with the amount of users registered each day for the past month (so it'll basically just be a set of value pairs for the date/users) and save that in another table/database.
Thanks!
It kind of depends on what sorts of analysis you are going to be doing on these stats. If you are going to be doing a lot of different operations (averaging, summing, joining...) you may find NoSQL solutions to be more of a pain then they are worth.
However, if you are storing stats mostly for a display purpose, or for very specific analysis routines, NoSQL solutions start to shine.
If your data is small enough, stick with a SQL solution, which will give the benefit of a full query engine to work with, but if you have lots of values (one value a day is nothing, even if you were running for a million years), and are worried about storage size and performance, NoSQL options once again may be worth it.
If your data is semi-structured, take a look at CouchDB, which offers some rudimentary indexing and querying support, which could provide some basis for analysis routines. If you are storing individual values with very little structure, my best advice would be to take a look at Tokyo Cabinet and Tokyo Tyrant, which are absolutely incredible options for key-value storage.
NoSQL systems tend to optimize the case where data is stored frequently, but accessed infrequently. In the case of statistics, you might gather lots of data from a (social) site frequently in small bits, which is optimized for. But retrieval and analysis might be slower... It of course depends on which "NoSql" System you decide to use.

Resources