Optimizing MongoDB for reads [closed] - multithreading

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm using MongoDB as a read only document source, used for computing statistics. Each document has no subdocuments, but the database has approximately ~900k documents and will grow by ~ 1k documents each day, added at a time where the database will be idle.
So, I'd like to understand the following things:
I've read that MongoDB works best when the entire collection is stored in RAM. Assuming my database is ~400MB and our server can easily cram the whole thing into RAM, is there a way I can tell MongoDB to pre-load my entire collection into RAM?
I've also read that there are cases where creating replica sets will help with the read performance of the database. Is my scenario one of the cases where this will help?
I'm threading my statistical calculations, but notice that the amount of time to complete the queries I run against mongoDB when doing these calculations triples when I thread them as opposed to running them synchronously. Is there anything I can do to improve the performance of the DB when I'm making requests against the same collection simultaneously?

No, MongoDB DOES NOT WORK BEST when the collection is in RAM. I have no idea who told you that but it is a common mis-conception about how MongoDB works.
MongoDB works best when it can not only fit your working set into RAM ( What does it mean to fit "working set" into RAM for MongoDB? ) but also load it in RAM at significantly great speed. One thing that can help the speed of paging in your working set is the size of your documents.
This is one reason why MongoDB is limited to 16MB, it has been found that sizes greater start to have a seriously detremental performance impact. Basically you spend too much time loading your data from the disk, this is one reason for de-normalisation by logically splitting tables in SQL techs; to make them faster to load.
This means you may have to optimise both the size of the value and the size of the field name to match performance needs for your reads. You will of course also have to match hardware.
Replica sets are not actually designed to help with read performance, they are designed to give your data high availability by making automated failover. The topic you read suggests getting stale reads from secondaries. This, as has been proven (edit: since proven is a strong word and this is scenario based I'm going to say "found") recently, can actually be less performant than using PrimaryPreferred read preference.
As for improving performance we would need stats from you on page faults, IO bottlenecks and general mongostat and top.

About Point 1:
You can use the touch command to persuade the database to load a collection into memory. But keep in mind that this isn't permanent. When you don't access the cached documents soon, they will get uncached in favor of more frequently-used documents.
About Point 2 and 3:
Replica-sets are a good way to improve the performance of parallel read operations. Each server of a replica-set mirrors the whole data and can respond to any query on its own without having to contact the other servers. That means when you double the number of servers in your replica-set, you also double the performance of simultaneous queries.
Keep in mind that the read preferences you set on your connection might prevent it from using more than one server.
Alternatively you can build a sharded cluster, but this is technically a lot more complex than a replica-set and won't improve read-performance much when your queries don't match the shard-key of the collection or when you selected your shard-key in a way that the requests aren't evenly distributed between the shards.

Related

Aerospike over in-memory Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I'm a bit of a noob with noSQL & cannot decide for Aerospike over complete in-memory Cassandra.
Use Case:
To be used for multiple services in our University ( From social platform to internal financial analytics to network logging to real-time messaging). Our daily active users are also constant(~5000). So my primary requirement is not to get 1M+ TPS but to reduce latency and maintain consistency serving the user data as fast as possible. The DB would be running on 3 bare metal servers with 32-vcore 128GB-Ram 256GB-SSD each connected in 10Gbit. The data won't be exceeding Ram as most of the data will be archived(to another ElacticSearch Server) every 6 Months.
Also, I don't mind to take the challenge and do a bit over-engineering & it's fine if the Cluster is hard to set-up but it should require little or no maintenance for years.
So looking over in-memory DB's Aerospike seemed a great choice. Then I was very exited to go blazingly fast but then I looked at Aerospike total garbage? & We use Aerospike heavily. It works just fine. Now, this got me thinking it this the best fit for me?
Or should I go for complete in-memory Cassandra which is not optimised for complete in-memory table & still is less performant than Aerospike but has a better data model fit for me, does not have consistency issues and is tried & tested.( I am intrigued by ScyllaDB but it doesn't have in-memory tables)
I would like to have answers from people with production experience with Aerospike & Cassandra. Also please tell me if I am completely wrong.
My first point is that this isn't a valid Stackoverflow question. When you click on Ask Question the How to Ask block states Is your question about programming?
The Medium article is poorly written opinion from a faceless user, without data to back up the claims. Yes, Aerospike has bugs, as do all databases. GCE itself has bugs that can affect a distributed database such as Aerospike. I haven't seen any issue in the aerospike/aerospike-server repo on GitHub talking about this user's problems on GCE. Usually people who use a software product in production will report a bug that affects them severely. The lack of a bug report is a "bad smell" - is it FUD?
Aerospike is in fact used for high performance at high scale by many customers. I'm going to assume that even if said Medium blogger actually used Aerospike in production, it probably wasn't on the scale of 3Mtps reads and 1.5Mtps writes that AppNexus see for their Aerospike installation. Perhaps the proof of whether it's an appropriate Key-Value database for a production system is in its current use by real customers.
Let's address your specific question about whether to use Cassandra or Aerospike for a key-value use case. You probably want to start with high quality benchmarks comparing the two, but how do you determine if those are well done? Aerospike has published a manifesto about what high quality benchmarking of databases should look like.
When you run into a benchmark, read all the way down the post and check the object sizes, the number of objects, size of the data set, length of the test. If the vendor chose a tiny data set and ran their test for a few minutes it isn't a valid benchmark. There's nothing to be learned from it about how the database would perform at real, sustainable loads, over realistic data sizes, for extended periods of time.
In the spirit of the manifesto, Aerospike has published a detailed benchmark versus both Cassandra and ScyllaDB. Both show that Aerospike has consistently lower latencies with little variation, while the other databases have wild latency fluctuations. This is due to the architecture differences between the cache-first architecture of first generation NoSQL like Cassandra (also Couchebase, MongoDB, etc) and the hybrid-memory architecture design of Aerospike.
In a cache-first architecture, the database will first look to its in-memory caches for the keys and objects, and only go to disk when there's a cache-miss. The database then takes a big latency penalty for paging data from SSD into memory, and then operating on this memory. Such databases expect the majority of reads to come out of cache. Once the cache hit ratio drops into a realistic range (not their hoped for 80% - 95%) a cache-first database will display latency spikes as it goes to disk. As a consequence, a Cassandra cluster needs lots of RAM across many nodes.
In the case of Aerospike, the hybrid-memory architecture (HMA) holds the primary index (metadata about all the objects) in DRAM, and relies on optimizations around SSD performance to fetch the data directly from disk at low latency. There's a wide range of performance between different SSDs (see Aerospike on Intel Optane), so you would use data from the open-source ACT tool to understand what the sustainable read/write performance of the SSD is, while still achieving 95% of operations <= 1ms. HMA therefore requires very little memory per-node (64B per-object times the replication factor), resulting in smaller clusters. Data is served directly from SSD so you can expect consistently low latency for your operations, even at high scale.
If you're storing all your data fully in memory, take a look at What's New in Aerospike 3.12? and What's New in Aerospike 3.11?, as they include optimizations for such a use case. Specifically see sprigs and CPU pinning.

Temporary speed improvements

In our workflow, we have little ongoing work in the arangodb (~1% cpu use). For about 30 minutes of the day usage spikes and we need it to be more performant (helping do a 3s query to 1s).
Instead of moving up the instance box that it's hosted on, is there a way to get more out of arango temporarily during peak times? Would this be clustering or should we just look into temporarily boosting the instance that it's on.
Accumulating above suggestions plus adding some more that fit the generic nature of this question.
if possible split read/write workload, either in a timely fashion by holding back writes, or by switching to a new collection for the new writes.
make sure indices are properly set (use explain)
try whether query profiling can help you improve the performance

looking for an opensource in memory database with indexes [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
we are looking for an opensource in memory database which can support indexes.
The use case is that we have lot of items that are going to grow in a big way.
Each item has a few fields on which we need to query.
Currently we store the data in application's memory. However with increasing data, we have to think about distributing/sharding the db.
We have looked at a few options
Redis cluster could be used, but it does not have the concept of
indexes or SQL like queries.
Apache Ignite is both in-memory, and distributed as well as provides
SQL queries. However, the problem is that ignite fires all
queries into all master nodes, so that the final result will be
slower than the slowest of those queries. It seems like a problem
because a non performing/slow node out of a number of nodes can
really slow down the application a lot. Further in ignite, reads are
done from the masters and slaves are not used, so that it is
difficult to scale the queries. Increasing the nodes will have
negative impact as the no of queries will increase and it will be
even slower.
Cassandra - The in-memory option in cassandra can be used, but it
seems that the max size of a table per node can be 1 GB. If
our table is more than 1 GB, we will have to resort to partitioning
which will inturn lead cassandra to make multiple queries(one per
node) and it is a problem(same as ignite). Not sure whether reads in
cassandra in-memory table can be scaled by increasing the number of
slaves.
We are open to other solutions but wondering whether the multi-query will be a problem everywhere(like hazelcast).
The ideal solution for our use case would be an in-memory database with indexes which could be read scaled by increasing the number of slaves. Making it distributed/sharded will lead to multiple queries and we are reluctant because one erring node could slow the whole system down.
Hazelcast supports indexes (sorted & unsorted) and what is important there is no Multi-Query problem with Hazelcast.
Hazelcast supports a PartitionPredicate that restricts the execution of a query to a node that is a primaryReplica of the key passed to the constructor of the PartitionPredicate. So if you know where the data resides you can just query this node. So no need to fix or implement anything to support it, you can use it right away.
It's probably not reasonable to use it all the time. Depends on your use-case.
For complex queries that scan a lot of data but return small results it's better to use OBJECT inMemoryFormat. You should get excellent execution times and low latencies.
Disclaimer: I am GridGain employee and Apache Ignite committer.
Several comments on your concerns:
1) Slow nodes will lead to problems in virtually any clustered environment, so I would not consider this as disadvantage. This is reality you should embrace and accept. It is necessary understand why it is slow and fix/upgrade it.
2) Ignite are able to perform reads from slaves both for regular cache operations [1] and for SQL queries executed over REPLICATED caches. In fact, using REPLICATED cache for reference data is one of the most important features allowing Ignite to scale smoothly.
3) As you correctly mentioned, currently query is broadcasted to all data nodes. We are going to improve it. First, we will let users to specify partitions to execute the query against [2]. Second, we are going to improve our optimizer so that it will try to calculate target data nodes in advance to avoid broadcast [3], [4]. Both improvements will be released very soon.
4) Last, but not least - persistent layer will be released in several months [5], meaning that Ignite will become distributed database with both in-memory and persistence capabilities.
[1] https://ignite.apache.org/releases/mobile/org/apache/ignite/configuration/CacheConfiguration.html#isReadFromBackup()
[2] https://issues.apache.org/jira/browse/IGNITE-4523
[3] https://issues.apache.org/jira/browse/IGNITE-4509
[4] https://issues.apache.org/jira/browse/IGNITE-4510
[5] http://apache-ignite-developers.2346864.n4.nabble.com/GridGain-Donates-Persistent-Distributed-Store-To-ASF-Apache-Ignite-tc16788.html
I can give opinions on cassandra. Max size of your table per node is configurable and tunable so it depends on the amount of the memory that you are willing to pay. Partitioning is built in into cassandra so basically cassandra manages it for you. It's relatively simple to do paritioning. Basically first part of the primary key syntax is partitioning key and it determines on which node in the cluster the data lives.
But I also guess you are aware of this since you are mentioning multiple query per node. I guess there is no nice way around it.
Just one slight remark there is no master slaves in cassandra. Every node is equal. Basically client asks any node in the cluster, this node then becomes coordinator nodes and since it gets partitioning key it knows which node to ask the data for and it gives it then to the client.
Other than that I guess you read upon cassandra enough (from what I can see in your question)
Basically it comes down to the access pattern, if you know how you are going to access your data then it's the way to go. But other databases are also pretty decent.
Indexing with cassandra usually hides some potential performance problems. Usually people avoid it because in cassandra index has to be build for every record there is on whole cluster and it's done per node. This doesn't really scale. Basically you always have to do query first no matter how ypu put it with cassandra.
Plus the in memory seems to be part of the DSE cassandra. Not the open source or community one. You have to take this into account also.

Running two instances of MongoDB

I am working on a highly I/O Intensive application (A selection based on the availability of seats) using MERN Stack.
The app is expected to get 2000 concurrent users.
I want to know whether it's wise to use two instances of MongoDB, one on the RAM (in memory) and another on the Hard drive.
The RAM one to be used to store the available seats.
And the Hard drive one to backup the data after regular intervals.
But at the same time I know that if the server crashes my MongoDB data on the RAM is lost.
Could anyone guide me please?
I am using Socket IO instead of AJAX...
I don't think you need this. You can get a good server, with a good amount of RAM, and if you create your indexes correctly, everything should work fine.
Also Mongo 3 won't lock the entire database on each update, like Mongo 2 used to do.
I believe the best approach would be using something like Memcached in order to improve reads. Also, in order to improve database performance and have automated failover use sharding and replica sets.
Consider also that you would have headaches when your server restarted and you lose your data...
This seems unnecessary, because MongoDB already behaves exactly like that out-of-the-box.
The old engine (MMAPv1) was using memory-mapped files, which means that if you have as much RAM as you have data, it practically behaves like an in-memory database with automatic hard-drive backing.
The new engine (Wired Tiger) works a bit different in detail, but the same in general. It allows you to set a cache size (config key storage.wiredTiger.engineConfig.cacheSizeGB). When the cache size is as large enough, you again have an in-memory database with automatic hard-drive mirroring.
More about that in the storage FAQ.
What you are talking about is a scaling problem. You have two options when it comes to scaling: Add resources causing the bottleneck to your existing setup (more RAM and faster disks, usually) or expand your setup. You should first add resources, almost up to the point where adding resources does not give you an according bang for the buck.
At some point, this "scaling up" will not be feasible any more and you have to distribute the load amongst more nodes.
MongoDB comes with a feature for distributing load amongst (logical) nodes: sharding.
Basically, it works like this: multiple replica sets each form a logical node called a shard. Each shard in turn only holds a subset of your data. Instead of connecting to the shards directly, you acres your data via a mongos query router which is aware of which shard holds the data to answer the query and where to write new data.
By carefully selecting your shard key, your reads and writes should be evenly distributed between the shards.
Side note: putting production data on a standalone instance instead of a replica set crosses the border of negligence in my book. Given the prices of today's (rented) hardware, it has never been easier to eliminate a single point of failure than with a MongoDB replica set.

Using Cassandra for user registration [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am using Cassandra for a social app I am working on. I really like all Cassandra has to offer. I want to know if it is okay to store username and password in Cassandra itself or should I use a second database (mongodb) for storing username and password. Will eventual consistency cause problems when a user resets password or changes email. I am using email as primary key to look up user data. I am writing the backend in JavaScript in node.js.
Let me explain what i am trying to do. I want to add a password reset and a lock out feature after lets say 5 wrong password tries. The problem is if the user tries to login before the third servers updates and what happens if the password is compared to the outdated data on the third server? Wouldn't the user get locked out. Is the best course of action to store the username and password in a separate database such as mongodb? Or is their another way to solve this issue.
You could store username/pwd in Cassandra - their basic data modeling page has a relevant example that may help your thinking: http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling.
Eventual consistency could be an issue if you don't pay attention to how you read and write data (and your reads/writes happen near each other). It is possible to tune the cluster to behave like a 'normal' store (e.g. once I write the data, subsequent reads will get the last written value); however, the configuration below reduces some of the great things Cassandra brings to the table.
Data Consistency
You can configure a data consistency of ALL for your writes - meaning that you ensure that all nodes are up to date before your write completes. No eventual consistency issues on the writes.
A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition key.
However, this write takes longer and requires that enough nodes are up and available for you to get the guarantees you are asking for.
You can also specify read consistency of ALL which means:
Returns the record after all replicas have responded. The read operation will fail if a replica does not respond.
In this case, if you are writing data to three nodes and one of them is unable to respond your read will fail because Cassandra cannot provide the requested guarantee of consistency.
I should point out that these are the most strict configurations you can use and there are others that may work for your situation. The point I am trying to make is: as you ask Cassandra to behave like a more traditional store, you trade off some of the aspects of the cluster that are really appealing.
You need to consider the anticipated characteristics of your system (read volume, write volume, importance of those two being in sync etc) when making this decision. IMOP while you probably could use Cassandra for this type of data, unless you are dealing with a huge volume of users, it wouldn't be my first choice.

Resources