Is RethinkDB a good fit for a generic Real-time aggregation platform? [closed]

Is RethinkDB a good fit for a generic Real-time aggregation platform? [closed] - cassandra

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I need your help to verify if RethinkDB fits my use case.
Use case
My team is building a generic Real-time aggregation platform which needs to:
join data from a lot of Kafka topics
Joins need to be done on raw data
Topics have the same key
Data in topics is sometimes a “snapshot” (updatable) and sometimes en “event” (non-updatable)
The destination of the joined data will be some analytical OLAP DB. Clickhouse, Druid, etc. Depending on the case. These systems work with “deltas” (SCDs). Because of “snapshots”, I need stateful processing.
Updates for snapshots can come up to 7 days later
Topics receive around 20k msg/s with peaks up to 200k msg/s
Data in topics is json from 100 Bytes to 5kB
Data in topics can have duplicates
Duplicates are deduplicated with “version” json field which is part of every topic. Data should be processed only if new_version > old_version. Or if old_version didn't exist.
I already have a POC with Cassandra with five stages:
Cassandra Inserter - consumes from.all Kafka topics. Doing insert only for all topics in the same Cassandra table. Sharding is done on column which has the key as all the Kafka topics. So all the messages with the same key end-up in the same shard.
For every Cassandra insert an InsertEvent is produced to Kafka
Delta calculator - consumes InsertEvents and queries Cassandra by the sharding key. Gets all raw data and then deduplicates and creates deltas. The state is saved in another Cassandra cluster. By saving all the processed “versions”. Next time a new InsertEvent comes, we use the saved state “version” to get only two events: previous and current so we can create a DeltaEvent
DeltaEvent is produced to Kafka
ClickHouse / Druid ingest the data
So it's basically a 50/50 insert/read workload without updates to Cassandra.
With 14 Cassandra data nodes and 8 state nodes nodes it works OK up to 20k InsertEvent/s. With 25k InsertEvent/s the system begins to lag.
Nodes have 16GB Ram and disks are network storage backed by SSD (not ideal, I know, but can't change it now). Network 10 Gbit.
RethinkDB idea
I would like to do a new POC to try RethinkDB and use changefeeds to create deltas and to deduplicate. For this I would use a single table. Primary key / sharding key would be the Kafka key and all Kafka data from all topics with the same key would be joined/upserted in a single document.
The workload would be probably 10/90 insert/update. I would use squash: true, to avoid excessive reads and reduce the amount of DeltaEvents.
Do you think this is a good use case for RethinkDB?
Will it scale up to 200k msg/s which would be 20k inserts/s, 180k updates/s and around 150 k/reads via changefeeds?
I will need to delete data older than 7 days, how it will affect the insert/update/query workload?
do you have a proposal for a system which would be a better fit for this use case?
Thanks a lot,
Davor
PS: if you prefer reading a document, here it is: RethinkDB use case question.

IMHO, RehinkDB is good fit in your use case.
From RethinkDB docs
...RethinkDB scales to perform 1.3 million individual reads per second. ...RethinkDB performs well above 100 thousand operations per second in a mixed 50:50 read/write workload - while at the full level of durability and data integrity guarantees. ...performed all benchmarks across a range of cluster sizes, scaling up from one to 16 nodes.
Folks at RethinkDB have tested similar scenario using workloads from the YCSB benchmark suite and reported their results.
We found that in a mixed read/write workload, RethinkDB with two servers was able to perform nearly 16K queries per second (QPS) and scaled to almost 120K QPS while in a 16-node cluster. Under a read only workload and synchronous read settings, RethinkDB was able to scale from about 150K QPS on a single node up to over 550K QPS on 16 nodes. Under the same workload, in an asynchronous “outdated read” setting, RethinkDB went from 150K QPS on one server to 1.3M in a 16-node cluster.
Selecting workloads and hardware
...Out of the YCSB workload options, we chose to run workload A which comprises 50% reads and 50% update operations, and workload C which performs strictly read operations. All documents stored by the YCSB tests contain 10 fields with randomized 100 byte strings as values, with each document totaling about 1 KB in size.
Given the ease of scaling RethinkDB clusters across multiple instances, we deemed it necessary to observe performance when moving from a single RethinkDB instance to a larger cluster. We tested all of our workloads on a single instance of RethinkDB up to a 16-node cluster in varying increments of cluster size.
Additionally, I suggest reading through limitations on RethinkDB. I've copied some here.
There is a hard limit of 64 shards.
While there is no hard limit on the size of a single document, there is a recommended limit of 16MB for memory performance reasons.
The maximum size of a JSON query is 64M.
Primary keys are limited to 127 characters.
Secondary indexes do not store objects or null values.
Primary key strings may not include the null codepoint (U+0000).
By default, arrays on the RethinkDB server have a size limit of 100,000 elements. This can be changed on a per-query basis with the arrayLimit (or array_limit) option to run.
RethinkDB does not support Unicode collations, and does not normalize for identical characters with multiple codepoints (i.e, \u0065\u0301 and \u00e9 both represent the character “é” but RethinkDB treats them, and sorts them as, distinct characters).
Since yours is real-time system, RethinkDB memory requirements and crash recovery are also worth a read.
Furthermore, delete performance benchmark is missing.

Related

cassandra write throughput and scalability

This may sound like a dumb question but still I wanted someone/expert to answer/confirm this.
Lets say I have a 3 node cassandra cluster. Lets say I have one database and just one table. For this single table lets say I get a throughput of 1K writes/second with 3 node cassandra. If tomorrow my write load on this table increases/scales to 10K or 20K, will I be able to handle this write load by increasing the size of cluster by say 10x or 20x?
My understanding of cassandra says it is possible (as cassandra is both read and write scalable) but would want an expert to confirm.

Yes, Cassandra has Linear Scalability.
The scalability is linear as shown in the chart below. Each client system generates about 17,500 write requests per second, and there are no bottlenecks as we scale up the traffic. Each client ran 200 threads to generate traffic across the cluster.
Source : https://medium.com/netflix-techblog/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e

Yes - but only if your data is properly modeled - your data especially needs to be distributed evenly among your partition keys (since they map to specific replica nodes) to avoid hot spots. Given that, yes cassandra will scale horizontally well.
A "table" in cassandra is distributed among all nodes in your cluster. Each node is responsible for a range of tokens which are hashes of the partition key portion of your primary key.
Now, if you double your node count for example - the existing token ranges are split in half and distributed while bootstrapping the new nodes. So each node will only handle half of your inital requests. If you double your requests afterwards, each node will have roughly the same load as before.
For read intensive requests - choosing a higher replication factor helps when you can live with stale data for a while (e.g. read and write at a low consistency level).
There are good tutorials from DataStax available here https://academy.datastax.com/

Datastax states that:
What are the beneﬁts of Apache Cassandra?
Massively scalable ring architecture: Based on the best of Amazon Dynamo and Google BigTable, Cassandra’s peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability.
Linear scale performance: Nodes added to a Cassandra cluster (all done online) increase the throughput of your database in a predictable, linear fashion for both read and write operations.
So the answer is YES, it is possible. It may take some time to adding a new node and redistribute tokens. But it will scale as you change the number of nodes.
If you need more info to understand how it will scale , check this links below:
Benchmarking Cassandra Scalability on AWS
Adding nodes to Cassandra
Adding, replacing, moving and removing nodes

Yes, it is so, but with the single remark. You should consider replication factor (RF) and consistency level (CL) as they affect the scaling behaviour also.
For example, if you initially have the 10 nodes with RF=3, and you increase the nodes count up to 20 with the same RF=3, you'll get the linear increase in write throughput.
But if you want to increase the read throughput, you need to increase RF. And with the increased RF you had to decrease write consistency level to improve write throughput.
To summarize, you could not increase both read and write throughput in a linear way with the same RF and CL params.

looking for an opensource in memory database with indexes [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
we are looking for an opensource in memory database which can support indexes.
The use case is that we have lot of items that are going to grow in a big way.
Each item has a few fields on which we need to query.
Currently we store the data in application's memory. However with increasing data, we have to think about distributing/sharding the db.
We have looked at a few options
Redis cluster could be used, but it does not have the concept of
indexes or SQL like queries.
Apache Ignite is both in-memory, and distributed as well as provides
SQL queries. However, the problem is that ignite fires all
queries into all master nodes, so that the final result will be
slower than the slowest of those queries. It seems like a problem
because a non performing/slow node out of a number of nodes can
really slow down the application a lot. Further in ignite, reads are
done from the masters and slaves are not used, so that it is
difficult to scale the queries. Increasing the nodes will have
negative impact as the no of queries will increase and it will be
even slower.
Cassandra - The in-memory option in cassandra can be used, but it
seems that the max size of a table per node can be 1 GB. If
our table is more than 1 GB, we will have to resort to partitioning
which will inturn lead cassandra to make multiple queries(one per
node) and it is a problem(same as ignite). Not sure whether reads in
cassandra in-memory table can be scaled by increasing the number of
slaves.
We are open to other solutions but wondering whether the multi-query will be a problem everywhere(like hazelcast).
The ideal solution for our use case would be an in-memory database with indexes which could be read scaled by increasing the number of slaves. Making it distributed/sharded will lead to multiple queries and we are reluctant because one erring node could slow the whole system down.

Hazelcast supports indexes (sorted & unsorted) and what is important there is no Multi-Query problem with Hazelcast.
Hazelcast supports a PartitionPredicate that restricts the execution of a query to a node that is a primaryReplica of the key passed to the constructor of the PartitionPredicate. So if you know where the data resides you can just query this node. So no need to fix or implement anything to support it, you can use it right away.
It's probably not reasonable to use it all the time. Depends on your use-case.
For complex queries that scan a lot of data but return small results it's better to use OBJECT inMemoryFormat. You should get excellent execution times and low latencies.

Disclaimer: I am GridGain employee and Apache Ignite committer.
Several comments on your concerns:
1) Slow nodes will lead to problems in virtually any clustered environment, so I would not consider this as disadvantage. This is reality you should embrace and accept. It is necessary understand why it is slow and fix/upgrade it.
2) Ignite are able to perform reads from slaves both for regular cache operations [1] and for SQL queries executed over REPLICATED caches. In fact, using REPLICATED cache for reference data is one of the most important features allowing Ignite to scale smoothly.
3) As you correctly mentioned, currently query is broadcasted to all data nodes. We are going to improve it. First, we will let users to specify partitions to execute the query against [2]. Second, we are going to improve our optimizer so that it will try to calculate target data nodes in advance to avoid broadcast [3], [4]. Both improvements will be released very soon.
4) Last, but not least - persistent layer will be released in several months [5], meaning that Ignite will become distributed database with both in-memory and persistence capabilities.
[1] https://ignite.apache.org/releases/mobile/org/apache/ignite/configuration/CacheConfiguration.html#isReadFromBackup()
[2] https://issues.apache.org/jira/browse/IGNITE-4523
[3] https://issues.apache.org/jira/browse/IGNITE-4509
[4] https://issues.apache.org/jira/browse/IGNITE-4510
[5] http://apache-ignite-developers.2346864.n4.nabble.com/GridGain-Donates-Persistent-Distributed-Store-To-ASF-Apache-Ignite-tc16788.html

I can give opinions on cassandra. Max size of your table per node is configurable and tunable so it depends on the amount of the memory that you are willing to pay. Partitioning is built in into cassandra so basically cassandra manages it for you. It's relatively simple to do paritioning. Basically first part of the primary key syntax is partitioning key and it determines on which node in the cluster the data lives.
But I also guess you are aware of this since you are mentioning multiple query per node. I guess there is no nice way around it.
Just one slight remark there is no master slaves in cassandra. Every node is equal. Basically client asks any node in the cluster, this node then becomes coordinator nodes and since it gets partitioning key it knows which node to ask the data for and it gives it then to the client.
Other than that I guess you read upon cassandra enough (from what I can see in your question)
Basically it comes down to the access pattern, if you know how you are going to access your data then it's the way to go. But other databases are also pretty decent.
Indexing with cassandra usually hides some potential performance problems. Usually people avoid it because in cassandra index has to be build for every record there is on whole cluster and it's done per node. This doesn't really scale. Basically you always have to do query first no matter how ypu put it with cassandra.
Plus the in memory seems to be part of the DSE cassandra. Not the open source or community one. You have to take this into account also.

Azure Table Storage transaction limitations

I'm running performance tests against ATS and its behaving a bit weird when using multiple virtual machines against the same table / storage account.
The entire pipeline is non blocking (await/async) and using TPL for concurrent and parallel execution.
First of all its very strange that with this setup i'm only getting about 1200 insertions. This is running on a L VM box, that is 4 cores + 800mbps.
I'm inserting 100.000 rows with unique PK and unique RK, that should leverage the ultimate distribution.
Even more deterministic behavior is the following.
When I run 1 VM i get about 1200 insertions per second.
When I run 3 VM i get about 730 on each insertions per second.
Its quite humors to read the blog post where they are specifying their targets.
https://azure.microsoft.com/en-gb/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/
Single Table Partition– a table partition are all of the entities in a table with the same partition key value, and usually tables have many partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the 20,000 entities/second, which is the overall account target described above.
What shall I do to be able to utilize the 20k per second, and how would it be possible to execute more than 1,2k per VM?
--
Update:
I've now also tried using 3 storage accounts for each individual node and is still getting the performance / throttling behavior. Which i can't find a logical reason for.
--
Update 2:
I've optimized the code further and now i'm possible to execute about 1550.
--
Update 3:
I've now also tried in US West. The performance is worse there. About 33% lower.
--
Update 4:
I tried executing the code from a XL machine. Which is 8 cores instead of 4 and the double amount of memory and bandwidth and got a 2% increase in performance so clearly this problem is not on my side..

A few comments:
You mention that you are using unique PK/RK to get ultimate
distribution, but you have to keep in mind that the PK balancing is
not immediate. When you first create a table, the entire table will
be served by 1 partition server. So if you are doing inserts across
several different PKs, they will still be going to one partition
server and be bottlenecked by the scalability target for a single
partition. The partition master will only start splitting your
partitions among multiple partition servers after it has identified hot
partition servers. In your <2 minute test you will not see the
benefit of multiple partiton servers or PKs. The throughput in the
article is targeted towards a well distributed PK scheme with
frequently accessed data, causing the data to be divided amongst
multiple partition servers.
The size of your VM is not the issue as
you are not blocked on CPU, Memory, or Bandwidth. You can achieve
full storage performance from a small VM size.
Check out
http://research.microsoft.com/en-us/downloads/5c8189b9-53aa-4d6a-a086-013d927e15a7/default.aspx.
I just now did a quick test using that tool from a WebRole VM in the
same datacenter as my storage account and I acheived, from a single
instance of the tool on a single VM, ~2800 items per second upload
and ~7300 items per second download. This is using 1024 byte
entities, 10 threads, and 100 batch size. I don't know how efficient this tool is or if it disables Nagles Algorithm as I was unable to get great results (I got ~1000/second) using a batch size of 1, but at least with the 100 batch size it shows that you can achieve high items/second. This was done in US West.
Are you using Storage client library 1.7 (Microsoft.Azure.StorageClient.dll) or 2.0 (Microsoft.Azure.Storage.dll)? The 2.0 library has some performance improvements and should yield better results.

I suspect this may have to do with TCP Nagle.
See this MSDN article and this blog post.
In essence, TCP Nagle is a protocol-level optimization that batches up small requests. Since you are sending lots of small requests this is likely to negatively affect your performance.
You can disable TCP Nagle by executing this code when starting your application
ServicePointManager.UseNagleAlgorithm = false;

Are the compute instances and storage account in the same affinity group? Affinity groups ensure that network proximity between the services is optimal and should result in lower latency at the network level.
You can find affinity group configuration under the network tab.

I would tend to believe that the maximum throughput is for an optimized load. For example, I bet you that you can achieve higher performance using Batch requests than individual requests you are doing now. And of course, if you use GUIDs for your PK, you can't Batch in your current test.
So what if you changed your test to batch insert entities in groups of 100 (maximum per batch), still using GUIDs, but for which 100 entities would have the same PK?

HIVE/HDFS for realtime storage of sensor data on a massive scale?

I am evaluating sensor data collection systems with the following requirements,
1 million endpoints sending in 100 bytes of data every minute (as a time series).
Basically millions of small writes to the storage.
This data is write-once, so basically it never gets updated.
Access requirements
a. Full data for a user needs to be accessed periodically (less frequent)
b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.
Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.
Do HBase/Cassandra make more sense in this scenario?

I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale, and
make this data easily accessible and graphable. Thanks to HBase's
scalability, OpenTSDB allows you to collect many thousands of metrics
from thousands of hosts and applications, at a high rate (every few
seconds). OpenTSDB will never delete or downsample data and can easily
store billions of data points. As a matter of fact, StumbleUpon uses
it to keep track of hundred of thousands of time series and collects
over 600 million data points per day in their main production
datacenter.

There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.
Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.
Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).
Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.

Practical Limits of ElasticSearch + Cassandra

I am planning on using ElasticSearch to index my Cassandra database. I am wondering if anyone has seen the practical limits of ElasticSearch. Do things get slow in the petabyte range? Also, has anyone has any problems using ElasticSearch to index Cassandra?

See this thread from 2011, which mentions ElasticSearch configurations with 1700 shards each of 200GB, which would be in the 1/3 petabyte range. I would expect that the architecture of ElasticSearch would support almost limitless horizontal scalability, because each shard index works separately from all other shards.
The practical limits (which would apply to any other solution as well) include the time needed to actually load that much data in the first place. Managing a Cassandra cluster (or any other distributed datastore) of that size will also involve significant workload just for maintenance, load balancing etc.

Sonian is the company kimchy alludes to in that thread. We have over a petabyte on AWS across multiple ES clusters. There isn't a technical limitation to how far horizontally you can scale ES, but as DNA mentioned there are practical problems. The biggest by far is network. It applies to every distributed data storage. You can only move so much across the wire at a time. When ES has to recover from a failure, it has to move data. The best option is to use smaller shards across more nodes (more concurrent transfer), but you risk a higher rate of failure and exhorbitant cost per byte.

AS DNA mentioned, 1700 shards, but it is not 1700 shards but there are 1700 indexes each with 1 shard and 1 replica. So it is quite possible that these 1700 indexes are not present on single machine but are split around multiple machines.
So this is never a problem

I am currently starting working with Elisandra (Elasticsearch + Cassandra)
I am also, having problems to index Cassandra with elasticsearch. My problem is basically the node configuration.
Doing $ nodetool status you can see Host ID and then ruining:
curl -XGET http://localhost:9200/_cluster/state/?pretty=true
You can check that one of the node: is the same name as Host ID

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string