How does yugabytedb achieve high data density? [closed] - yugabytedb

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
It is understood that yugabytedb creates n key/value records per SQL row.
Still, how does it manage to achieve high data density?

#AVA
We use DocDB internally which uses a heavily modified RocksDB.
A summary from the doc links at the end of comment:
Each table gets split into several tablets. Each tablet being a RocksDB instance. RocksDB instances use size tiered compaction whi
RocksDB instances use size-tiered compaction which has low write IO amplification and higher space amplification (needs 50% free disk space). This is mitigated by having several small tablets. We get low write IO with no space amplification concerns.
Global memstore limits. Big number of tablets won't have overhead in-memory since memstores are used in 1 pool.
Global block-cache will handle caching on all sstables. This further reduces overhead instead of keeping per-rocksdb cache.
Global Throttled compactions and small/big compaction queues will help against compaction storms overwhelming the server
Striping tablet load uniformly across data disks: Each tablet can reside in a different disk (JBOD). This will loadbalance sstables & WAL IO between disks in the machine.
Efficient c++ implemention code. No stop-the-world GC helps with keeping latencies low & consistent.
When the server is at full usage we stop accepting new queries. This will keep the server from crashing or being overwhelmed.
On-disk block compression (low read/write IO).
Uncompressed in-memory block-cache: low cpu overhead on each read & able to server more hot queries.
Disabling Rocksdb WAL and only keeping RAFT wal. This reduces read/write IO.
https://docs.yugabyte.com/latest/architecture/docdb/performance/
https://docs.yugabyte.com/latest/architecture/concepts/yb-tserver/

Related

Is RethinkDB a good fit for a generic Real-time aggregation platform? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I need your help to verify if RethinkDB fits my use case.
Use case
My team is building a generic Real-time aggregation platform which needs to:
join data from a lot of Kafka topics
Joins need to be done on raw data
Topics have the same key
Data in topics is sometimes a “snapshot” (updatable) and sometimes en “event” (non-updatable)
The destination of the joined data will be some analytical OLAP DB. Clickhouse, Druid, etc. Depending on the case. These systems work with “deltas” (SCDs). Because of “snapshots”, I need stateful processing.
Updates for snapshots can come up to 7 days later
Topics receive around 20k msg/s with peaks up to 200k msg/s
Data in topics is json from 100 Bytes to 5kB
Data in topics can have duplicates
Duplicates are deduplicated with “version” json field which is part of every topic. Data should be processed only if new_version > old_version. Or if old_version didn't exist.
I already have a POC with Cassandra with five stages:
Cassandra Inserter - consumes from.all Kafka topics. Doing insert only for all topics in the same Cassandra table. Sharding is done on column which has the key as all the Kafka topics. So all the messages with the same key end-up in the same shard.
For every Cassandra insert an InsertEvent is produced to Kafka
Delta calculator - consumes InsertEvents and queries Cassandra by the sharding key. Gets all raw data and then deduplicates and creates deltas. The state is saved in another Cassandra cluster. By saving all the processed “versions”. Next time a new InsertEvent comes, we use the saved state “version” to get only two events: previous and current so we can create a DeltaEvent
DeltaEvent is produced to Kafka
ClickHouse / Druid ingest the data
So it's basically a 50/50 insert/read workload without updates to Cassandra.
With 14 Cassandra data nodes and 8 state nodes nodes it works OK up to 20k InsertEvent/s. With 25k InsertEvent/s the system begins to lag.
Nodes have 16GB Ram and disks are network storage backed by SSD (not ideal, I know, but can't change it now). Network 10 Gbit.
RethinkDB idea
I would like to do a new POC to try RethinkDB and use changefeeds to create deltas and to deduplicate. For this I would use a single table. Primary key / sharding key would be the Kafka key and all Kafka data from all topics with the same key would be joined/upserted in a single document.
The workload would be probably 10/90 insert/update. I would use squash: true, to avoid excessive reads and reduce the amount of DeltaEvents.
Do you think this is a good use case for RethinkDB?
Will it scale up to 200k msg/s which would be 20k inserts/s, 180k updates/s and around 150 k/reads via changefeeds?
I will need to delete data older than 7 days, how it will affect the insert/update/query workload?
do you have a proposal for a system which would be a better fit for this use case?
Thanks a lot,
Davor
PS: if you prefer reading a document, here it is: RethinkDB use case question.
IMHO, RehinkDB is good fit in your use case.
From RethinkDB docs
...RethinkDB scales to perform 1.3 million individual reads per second. ...RethinkDB performs well above 100 thousand operations per second in a mixed 50:50 read/write workload - while at the full level of durability and data integrity guarantees. ...performed all benchmarks across a range of cluster sizes, scaling up from one to 16 nodes.
Folks at RethinkDB have tested similar scenario using workloads from the YCSB benchmark suite and reported their results.
We found that in a mixed read/write workload, RethinkDB with two servers was able to perform nearly 16K queries per second (QPS) and scaled to almost 120K QPS while in a 16-node cluster. Under a read only workload and synchronous read settings, RethinkDB was able to scale from about 150K QPS on a single node up to over 550K QPS on 16 nodes. Under the same workload, in an asynchronous “outdated read” setting, RethinkDB went from 150K QPS on one server to 1.3M in a 16-node cluster.
Selecting workloads and hardware
...Out of the YCSB workload options, we chose to run workload A which comprises 50% reads and 50% update operations, and workload C which performs strictly read operations. All documents stored by the YCSB tests contain 10 fields with randomized 100 byte strings as values, with each document totaling about 1 KB in size.
Given the ease of scaling RethinkDB clusters across multiple instances, we deemed it necessary to observe performance when moving from a single RethinkDB instance to a larger cluster. We tested all of our workloads on a single instance of RethinkDB up to a 16-node cluster in varying increments of cluster size.
Additionally, I suggest reading through limitations on RethinkDB. I've copied some here.
There is a hard limit of 64 shards.
While there is no hard limit on the size of a single document, there is a recommended limit of 16MB for memory performance reasons.
The maximum size of a JSON query is 64M.
Primary keys are limited to 127 characters.
Secondary indexes do not store objects or null values.
Primary key strings may not include the null codepoint (U+0000).
By default, arrays on the RethinkDB server have a size limit of 100,000 elements. This can be changed on a per-query basis with the arrayLimit (or array_limit) option to run.
RethinkDB does not support Unicode collations, and does not normalize for identical characters with multiple codepoints (i.e, \u0065\u0301 and \u00e9 both represent the character “é” but RethinkDB treats them, and sorts them as, distinct characters).
Since yours is real-time system, RethinkDB memory requirements and crash recovery are also worth a read.
Furthermore, delete performance benchmark is missing.

Aerospike over in-memory Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I'm a bit of a noob with noSQL & cannot decide for Aerospike over complete in-memory Cassandra.
Use Case:
To be used for multiple services in our University ( From social platform to internal financial analytics to network logging to real-time messaging). Our daily active users are also constant(~5000). So my primary requirement is not to get 1M+ TPS but to reduce latency and maintain consistency serving the user data as fast as possible. The DB would be running on 3 bare metal servers with 32-vcore 128GB-Ram 256GB-SSD each connected in 10Gbit. The data won't be exceeding Ram as most of the data will be archived(to another ElacticSearch Server) every 6 Months.
Also, I don't mind to take the challenge and do a bit over-engineering & it's fine if the Cluster is hard to set-up but it should require little or no maintenance for years.
So looking over in-memory DB's Aerospike seemed a great choice. Then I was very exited to go blazingly fast but then I looked at Aerospike total garbage? & We use Aerospike heavily. It works just fine. Now, this got me thinking it this the best fit for me?
Or should I go for complete in-memory Cassandra which is not optimised for complete in-memory table & still is less performant than Aerospike but has a better data model fit for me, does not have consistency issues and is tried & tested.( I am intrigued by ScyllaDB but it doesn't have in-memory tables)
I would like to have answers from people with production experience with Aerospike & Cassandra. Also please tell me if I am completely wrong.
My first point is that this isn't a valid Stackoverflow question. When you click on Ask Question the How to Ask block states Is your question about programming?
The Medium article is poorly written opinion from a faceless user, without data to back up the claims. Yes, Aerospike has bugs, as do all databases. GCE itself has bugs that can affect a distributed database such as Aerospike. I haven't seen any issue in the aerospike/aerospike-server repo on GitHub talking about this user's problems on GCE. Usually people who use a software product in production will report a bug that affects them severely. The lack of a bug report is a "bad smell" - is it FUD?
Aerospike is in fact used for high performance at high scale by many customers. I'm going to assume that even if said Medium blogger actually used Aerospike in production, it probably wasn't on the scale of 3Mtps reads and 1.5Mtps writes that AppNexus see for their Aerospike installation. Perhaps the proof of whether it's an appropriate Key-Value database for a production system is in its current use by real customers.
Let's address your specific question about whether to use Cassandra or Aerospike for a key-value use case. You probably want to start with high quality benchmarks comparing the two, but how do you determine if those are well done? Aerospike has published a manifesto about what high quality benchmarking of databases should look like.
When you run into a benchmark, read all the way down the post and check the object sizes, the number of objects, size of the data set, length of the test. If the vendor chose a tiny data set and ran their test for a few minutes it isn't a valid benchmark. There's nothing to be learned from it about how the database would perform at real, sustainable loads, over realistic data sizes, for extended periods of time.
In the spirit of the manifesto, Aerospike has published a detailed benchmark versus both Cassandra and ScyllaDB. Both show that Aerospike has consistently lower latencies with little variation, while the other databases have wild latency fluctuations. This is due to the architecture differences between the cache-first architecture of first generation NoSQL like Cassandra (also Couchebase, MongoDB, etc) and the hybrid-memory architecture design of Aerospike.
In a cache-first architecture, the database will first look to its in-memory caches for the keys and objects, and only go to disk when there's a cache-miss. The database then takes a big latency penalty for paging data from SSD into memory, and then operating on this memory. Such databases expect the majority of reads to come out of cache. Once the cache hit ratio drops into a realistic range (not their hoped for 80% - 95%) a cache-first database will display latency spikes as it goes to disk. As a consequence, a Cassandra cluster needs lots of RAM across many nodes.
In the case of Aerospike, the hybrid-memory architecture (HMA) holds the primary index (metadata about all the objects) in DRAM, and relies on optimizations around SSD performance to fetch the data directly from disk at low latency. There's a wide range of performance between different SSDs (see Aerospike on Intel Optane), so you would use data from the open-source ACT tool to understand what the sustainable read/write performance of the SSD is, while still achieving 95% of operations <= 1ms. HMA therefore requires very little memory per-node (64B per-object times the replication factor), resulting in smaller clusters. Data is served directly from SSD so you can expect consistently low latency for your operations, even at high scale.
If you're storing all your data fully in memory, take a look at What's New in Aerospike 3.12? and What's New in Aerospike 3.11?, as they include optimizations for such a use case. Specifically see sprigs and CPU pinning.

looking for an opensource in memory database with indexes [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
we are looking for an opensource in memory database which can support indexes.
The use case is that we have lot of items that are going to grow in a big way.
Each item has a few fields on which we need to query.
Currently we store the data in application's memory. However with increasing data, we have to think about distributing/sharding the db.
We have looked at a few options
Redis cluster could be used, but it does not have the concept of
indexes or SQL like queries.
Apache Ignite is both in-memory, and distributed as well as provides
SQL queries. However, the problem is that ignite fires all
queries into all master nodes, so that the final result will be
slower than the slowest of those queries. It seems like a problem
because a non performing/slow node out of a number of nodes can
really slow down the application a lot. Further in ignite, reads are
done from the masters and slaves are not used, so that it is
difficult to scale the queries. Increasing the nodes will have
negative impact as the no of queries will increase and it will be
even slower.
Cassandra - The in-memory option in cassandra can be used, but it
seems that the max size of a table per node can be 1 GB. If
our table is more than 1 GB, we will have to resort to partitioning
which will inturn lead cassandra to make multiple queries(one per
node) and it is a problem(same as ignite). Not sure whether reads in
cassandra in-memory table can be scaled by increasing the number of
slaves.
We are open to other solutions but wondering whether the multi-query will be a problem everywhere(like hazelcast).
The ideal solution for our use case would be an in-memory database with indexes which could be read scaled by increasing the number of slaves. Making it distributed/sharded will lead to multiple queries and we are reluctant because one erring node could slow the whole system down.
Hazelcast supports indexes (sorted & unsorted) and what is important there is no Multi-Query problem with Hazelcast.
Hazelcast supports a PartitionPredicate that restricts the execution of a query to a node that is a primaryReplica of the key passed to the constructor of the PartitionPredicate. So if you know where the data resides you can just query this node. So no need to fix or implement anything to support it, you can use it right away.
It's probably not reasonable to use it all the time. Depends on your use-case.
For complex queries that scan a lot of data but return small results it's better to use OBJECT inMemoryFormat. You should get excellent execution times and low latencies.
Disclaimer: I am GridGain employee and Apache Ignite committer.
Several comments on your concerns:
1) Slow nodes will lead to problems in virtually any clustered environment, so I would not consider this as disadvantage. This is reality you should embrace and accept. It is necessary understand why it is slow and fix/upgrade it.
2) Ignite are able to perform reads from slaves both for regular cache operations [1] and for SQL queries executed over REPLICATED caches. In fact, using REPLICATED cache for reference data is one of the most important features allowing Ignite to scale smoothly.
3) As you correctly mentioned, currently query is broadcasted to all data nodes. We are going to improve it. First, we will let users to specify partitions to execute the query against [2]. Second, we are going to improve our optimizer so that it will try to calculate target data nodes in advance to avoid broadcast [3], [4]. Both improvements will be released very soon.
4) Last, but not least - persistent layer will be released in several months [5], meaning that Ignite will become distributed database with both in-memory and persistence capabilities.
[1] https://ignite.apache.org/releases/mobile/org/apache/ignite/configuration/CacheConfiguration.html#isReadFromBackup()
[2] https://issues.apache.org/jira/browse/IGNITE-4523
[3] https://issues.apache.org/jira/browse/IGNITE-4509
[4] https://issues.apache.org/jira/browse/IGNITE-4510
[5] http://apache-ignite-developers.2346864.n4.nabble.com/GridGain-Donates-Persistent-Distributed-Store-To-ASF-Apache-Ignite-tc16788.html
I can give opinions on cassandra. Max size of your table per node is configurable and tunable so it depends on the amount of the memory that you are willing to pay. Partitioning is built in into cassandra so basically cassandra manages it for you. It's relatively simple to do paritioning. Basically first part of the primary key syntax is partitioning key and it determines on which node in the cluster the data lives.
But I also guess you are aware of this since you are mentioning multiple query per node. I guess there is no nice way around it.
Just one slight remark there is no master slaves in cassandra. Every node is equal. Basically client asks any node in the cluster, this node then becomes coordinator nodes and since it gets partitioning key it knows which node to ask the data for and it gives it then to the client.
Other than that I guess you read upon cassandra enough (from what I can see in your question)
Basically it comes down to the access pattern, if you know how you are going to access your data then it's the way to go. But other databases are also pretty decent.
Indexing with cassandra usually hides some potential performance problems. Usually people avoid it because in cassandra index has to be build for every record there is on whole cluster and it's done per node. This doesn't really scale. Basically you always have to do query first no matter how ypu put it with cassandra.
Plus the in memory seems to be part of the DSE cassandra. Not the open source or community one. You have to take this into account also.

Optimizing MongoDB for reads [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm using MongoDB as a read only document source, used for computing statistics. Each document has no subdocuments, but the database has approximately ~900k documents and will grow by ~ 1k documents each day, added at a time where the database will be idle.
So, I'd like to understand the following things:
I've read that MongoDB works best when the entire collection is stored in RAM. Assuming my database is ~400MB and our server can easily cram the whole thing into RAM, is there a way I can tell MongoDB to pre-load my entire collection into RAM?
I've also read that there are cases where creating replica sets will help with the read performance of the database. Is my scenario one of the cases where this will help?
I'm threading my statistical calculations, but notice that the amount of time to complete the queries I run against mongoDB when doing these calculations triples when I thread them as opposed to running them synchronously. Is there anything I can do to improve the performance of the DB when I'm making requests against the same collection simultaneously?
No, MongoDB DOES NOT WORK BEST when the collection is in RAM. I have no idea who told you that but it is a common mis-conception about how MongoDB works.
MongoDB works best when it can not only fit your working set into RAM ( What does it mean to fit "working set" into RAM for MongoDB? ) but also load it in RAM at significantly great speed. One thing that can help the speed of paging in your working set is the size of your documents.
This is one reason why MongoDB is limited to 16MB, it has been found that sizes greater start to have a seriously detremental performance impact. Basically you spend too much time loading your data from the disk, this is one reason for de-normalisation by logically splitting tables in SQL techs; to make them faster to load.
This means you may have to optimise both the size of the value and the size of the field name to match performance needs for your reads. You will of course also have to match hardware.
Replica sets are not actually designed to help with read performance, they are designed to give your data high availability by making automated failover. The topic you read suggests getting stale reads from secondaries. This, as has been proven (edit: since proven is a strong word and this is scenario based I'm going to say "found") recently, can actually be less performant than using PrimaryPreferred read preference.
As for improving performance we would need stats from you on page faults, IO bottlenecks and general mongostat and top.
About Point 1:
You can use the touch command to persuade the database to load a collection into memory. But keep in mind that this isn't permanent. When you don't access the cached documents soon, they will get uncached in favor of more frequently-used documents.
About Point 2 and 3:
Replica-sets are a good way to improve the performance of parallel read operations. Each server of a replica-set mirrors the whole data and can respond to any query on its own without having to contact the other servers. That means when you double the number of servers in your replica-set, you also double the performance of simultaneous queries.
Keep in mind that the read preferences you set on your connection might prevent it from using more than one server.
Alternatively you can build a sharded cluster, but this is technically a lot more complex than a replica-set and won't improve read-performance much when your queries don't match the shard-key of the collection or when you selected your shard-key in a way that the requests aren't evenly distributed between the shards.

Usage of Cassandra 1.2 Vnodes in production [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
One year passed since Vnodes was released with Cassandra 1.2. I have read a couple of Datastax articles describering this feature, they said the feature is awesome, but I want to ask those people who use it in production:
Is it really stable and ready to production?
What about Repair speed and disk usage overhead while Repair is running? Very important for us
What about rebalancing speed?
What about Hadoop stability/performance while using it with Cassandra vnodes enabled?
When should I avoid of using vnodes?
We have 1.5Tb per node with RF=3. When I turned vnodes on is all the data will be redistributed? My concern is network
I can't answer all of your questions, but here's what I can help with.
Repair is only very slightly affected by vnodes. Assuming you have 256 tokens per node, there are 256 times as many repair tasks with each one being 256 times smaller. For anything other than a very small amount of data, the extra overhead in creating the extra tasks is negligible. So I don't think you will notice any difference with repair with 1.5 TB of data.
You don't need to rebalance with vnodes. When you add and remove nodes the cluster remains balanced.
Upgrading to vnodes is the biggest challenge. Practically all data needs to be redistributed. This can be done with shuffle (which in practice doesn't work very well so is not recommended), decommissioning and bootstrapping each node (which leaves one node temporarily storing a copy of all your data) or by duplicating your hardware and creating a new virtual data center and then decommissioning the old one.

Resources