I am looking to build out a realtime pubsub database backend. RethinkDB is actually a perfect package for what I need, mainly because of it's very low latency changefeeds. But RethinkDB seems to be a DB that you can expect about 10k-20k inserts per second on two machines. Whereas I have seen some postings claim people are getting 1 million inserts per second on DB's like Cassandra with comparable hardware, but Cassandra doesn't have the realtime changefeeds feature.
So my question is, is there another DB, or combination of open source systems, which can provide the low latency changefeed functionality of RethinkDB, but enable it to occur on a scale much much larger than RethinkDB? Both quantity of inserts per second, and amount of users that are subscribed to change feeds are both important requirements that need to be high as possible.
RethinkDB might still fit your needs if you can scale out to a robust cluster (lots of nodes). Below is a link to a report they generated with performance metrics scaling up to a 16-node cluster.
https://rethinkdb.com/docs/2-1-5-performance-report/
Related
Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.
Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.
Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.
I am actually planing our next version of our telemetry system architecture. I am strongly considering Pulsar at the messaging solution.
To better understand what's this technology is best for, can someone share their use cases of why their use the infinite retention of a topic other than audit trail ?
I was main goal is to see if our telemetry data could be simply stored in a pulsar topic and query that for analytics purpose instead of using a time series database like Apache Druid.
Thanks !
The use-case I've had for infinite retention is when you want to store the history going back to the beginning: e.g. in an event-sourcing style approach, the longer you're keeping the events archived, the more able you are to remix your state.
With durable-log style storage, remember that it heavily optimizes for slurping the log starting at some point. For higher-volume queries or queries with strict latency requirements, this is generally pretty unsuited for that sort of workload, and even more so if you can't limit reads to a single partition (remember also that with multiple partitions, even the ordering of the messages in the log may be difficult to reconstruct). For infrequent queries with loose latency requirements, though, storing them in pulsar might not be that bad, especially if you'd be using pulsar already to feed data into the time-series store (as you can then dispense with the time-series store).
we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.
Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.
JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.
Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?
The new new Azure SQL Database Services look good. However I am trying to work out how scalable they really are.
So, for example, assume a 200 concurrent user system.
For Standard
Workgroup and cloud applications with "multiple" concurrent transactions
For Premium
Mission-critical, high transactional volume with "many" concurrent users
What does "Multiple" and "Many" mean?
Also Standard/S1 offers 15 DTUs while Standard/S2 offers 50 DTUs. What does this mean?
Going back to my 200 user example, what option should I be going for?
Azure SQL Database Link
Thanks
EDIT
Useful page on definitions
However what is "max sessions"? Is this the number of concurrent connections?
There are some great MSDN articles on Azure SQL Database, this one in particular has a great starting point for DTUs. http://msdn.microsoft.com/en-us/library/azure/dn741336.aspx and http://channel9.msdn.com/Series/Windows-Azure-Storage-SQL-Database-Tutorials/Scott-Klein-Video-02
In short, it's a way to understand the resources powering each performance level. One of the things we know when talking with Azure SQL Database customers, is that they are a varied group. Some are most comfortable with the most absolute details, cores, memory, IOPS - and others are after a much more summarized level of information. There is no one-size fits all. DTU is meant for this later group.
Regardless, one of the benefits of the cloud is that it's easy to start with one service tier and performance level and iterate. In Azure SQL Database specifically you can change the performance level while you're application is up. During the change there is typically less than a second of elapsed time when DB connections are dropped. The internal workflow in our service for moving a DB from service tier/performance level follows the same pattern as the workflow for failing over nodes in our data centers. And nodes failing over happens all the time independent of service tier changes. In other words, you shouldn’t notice any difference in this regard relative to your past experience.
If DTU's aren't your thing, we also have a more detailed benchmark workload that may appeal. http://msdn.microsoft.com/en-us/library/azure/dn741327.aspx
Thanks Guy
It is really hard to tell without doing a test. By 200 users I assume you mean 200 people sitting at their computer at the same time doing stuff, not 200 users who log on twice a day. S2 allows 49 transactions per second which sounds about right, but you need to test. Also doing a lot of caching can't hurt.
Check out the new Elastic DB offering (Preview) announced at Build today. The pricing page has been updated with Elastic DB price information.
DTUs are based on a blended measure of CPU, memory, reads, and writes. As DTUs increase, the power offered by the performance level increases. Azure has different limits on the concurrent connections, memory, IO and CPU usage. Which tier one has to pick really depends upon
#concurrent users
Log rate
IO rate
CPU usage
Database size
For example, if you are designing a system where multiple users are reading and there are only a few writers, and if your application middle tier can cache the data as much as possible and only selective queries / application restart hit the database then you may not worry too much about the IO and CPU usage.
If many users are hitting the database at the same time, you may hit the concurrent connection limit and requests will be throttled. If you can control user requests coming to the database in your application then this shouldn't be a problem.
Log rate: Depends upon the volume of the data changes (including additional data pumping in the system). I have seen application steadily pumping the data vs data being pumped all at once. Selecting the right DTU again depends upon how one can do throttling at the application end and get steady rate.
Database size: Basic, standard, and premium has different allowed max sizes, and this is another deciding factor. Using table compression kind of features helps reducing the total size, and hence total IO.
Memory: Tuning the expesnive queries (joins, sorts etc), enabling lock escalation / nolock scans help controlling the memory usage.
The very common mistake people usually do in database systems is scaling up their database instead of tuning the queries and application logic. So testing, monitoring the resources / queries with different DTU limits is the best way of dealing this.
If choose the wrong DTU, don't worry you can always scale up/ down in SQL DB and it is completely online operation
Also unless a strong reason migrate to V12 to get even better performance and features.
I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html