What is the fastest free DB available between redis, mongodb and mysql (or other if justified) to use with nodejs for storing and querying audio data?
The audio will be stored in the wav format, and should not exceed 1 MB in size.
I expect the number of concurrent requests to around 50 per second.
My constraints are free DBs, and speed.
Is there any comparison between the different DBs on this?
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I need your help to verify if RethinkDB fits my use case.
Use case
My team is building a generic Real-time aggregation platform which needs to:
join data from a lot of Kafka topics
Joins need to be done on raw data
Topics have the same key
Data in topics is sometimes a “snapshot” (updatable) and sometimes en “event” (non-updatable)
The destination of the joined data will be some analytical OLAP DB. Clickhouse, Druid, etc. Depending on the case. These systems work with “deltas” (SCDs). Because of “snapshots”, I need stateful processing.
Updates for snapshots can come up to 7 days later
Topics receive around 20k msg/s with peaks up to 200k msg/s
Data in topics is json from 100 Bytes to 5kB
Data in topics can have duplicates
Duplicates are deduplicated with “version” json field which is part of every topic. Data should be processed only if new_version > old_version. Or if old_version didn't exist.
I already have a POC with Cassandra with five stages:
Cassandra Inserter - consumes from.all Kafka topics. Doing insert only for all topics in the same Cassandra table. Sharding is done on column which has the key as all the Kafka topics. So all the messages with the same key end-up in the same shard.
For every Cassandra insert an InsertEvent is produced to Kafka
Delta calculator - consumes InsertEvents and queries Cassandra by the sharding key. Gets all raw data and then deduplicates and creates deltas. The state is saved in another Cassandra cluster. By saving all the processed “versions”. Next time a new InsertEvent comes, we use the saved state “version” to get only two events: previous and current so we can create a DeltaEvent
DeltaEvent is produced to Kafka
ClickHouse / Druid ingest the data
So it's basically a 50/50 insert/read workload without updates to Cassandra.
With 14 Cassandra data nodes and 8 state nodes nodes it works OK up to 20k InsertEvent/s. With 25k InsertEvent/s the system begins to lag.
Nodes have 16GB Ram and disks are network storage backed by SSD (not ideal, I know, but can't change it now). Network 10 Gbit.
RethinkDB idea
I would like to do a new POC to try RethinkDB and use changefeeds to create deltas and to deduplicate. For this I would use a single table. Primary key / sharding key would be the Kafka key and all Kafka data from all topics with the same key would be joined/upserted in a single document.
The workload would be probably 10/90 insert/update. I would use squash: true, to avoid excessive reads and reduce the amount of DeltaEvents.
Do you think this is a good use case for RethinkDB?
Will it scale up to 200k msg/s which would be 20k inserts/s, 180k updates/s and around 150 k/reads via changefeeds?
I will need to delete data older than 7 days, how it will affect the insert/update/query workload?
do you have a proposal for a system which would be a better fit for this use case?
Thanks a lot,
Davor
PS: if you prefer reading a document, here it is: RethinkDB use case question.
IMHO, RehinkDB is good fit in your use case.
From RethinkDB docs
...RethinkDB scales to perform 1.3 million individual reads per second. ...RethinkDB performs well above 100 thousand operations per second in a mixed 50:50 read/write workload - while at the full level of durability and data integrity guarantees. ...performed all benchmarks across a range of cluster sizes, scaling up from one to 16 nodes.
Folks at RethinkDB have tested similar scenario using workloads from the YCSB benchmark suite and reported their results.
We found that in a mixed read/write workload, RethinkDB with two servers was able to perform nearly 16K queries per second (QPS) and scaled to almost 120K QPS while in a 16-node cluster. Under a read only workload and synchronous read settings, RethinkDB was able to scale from about 150K QPS on a single node up to over 550K QPS on 16 nodes. Under the same workload, in an asynchronous “outdated read” setting, RethinkDB went from 150K QPS on one server to 1.3M in a 16-node cluster.
Selecting workloads and hardware
...Out of the YCSB workload options, we chose to run workload A which comprises 50% reads and 50% update operations, and workload C which performs strictly read operations. All documents stored by the YCSB tests contain 10 fields with randomized 100 byte strings as values, with each document totaling about 1 KB in size.
Given the ease of scaling RethinkDB clusters across multiple instances, we deemed it necessary to observe performance when moving from a single RethinkDB instance to a larger cluster. We tested all of our workloads on a single instance of RethinkDB up to a 16-node cluster in varying increments of cluster size.
Additionally, I suggest reading through limitations on RethinkDB. I've copied some here.
There is a hard limit of 64 shards.
While there is no hard limit on the size of a single document, there is a recommended limit of 16MB for memory performance reasons.
The maximum size of a JSON query is 64M.
Primary keys are limited to 127 characters.
Secondary indexes do not store objects or null values.
Primary key strings may not include the null codepoint (U+0000).
By default, arrays on the RethinkDB server have a size limit of 100,000 elements. This can be changed on a per-query basis with the arrayLimit (or array_limit) option to run.
RethinkDB does not support Unicode collations, and does not normalize for identical characters with multiple codepoints (i.e, \u0065\u0301 and \u00e9 both represent the character “é” but RethinkDB treats them, and sorts them as, distinct characters).
Since yours is real-time system, RethinkDB memory requirements and crash recovery are also worth a read.
Furthermore, delete performance benchmark is missing.
I am evaluating if we need to partition a table or not for ASE.
We would need to do some typical DB opertions like CRUD, but no complex queries.
Do you know normally, how large a table ASE can handle with decent performance?
Like how many rows and how large the total size ?
Thanks,
The table size is only restricted by database size (docs).
The maximum database size is 64TB for a server with 16k page (docs).
And what about decent performance? It depends on the database schema and available RAM (amount of memory for cache) and what you define as decent performance.
I have a NodeJS application that needs to stream data from an RDS Postgres, perform some relatively expensive CPU operations on the data, and insert it into another database. The CPU intensive portion I've offloaded into an AWS Lambda, such that the Node application will get a batch of rows and immediately pass them to the Lambda for processing. The bottleneck appears to be the speed in which the data can be received from Postgres.
In order to utilize multiple connections to the DB, I have an algorithm which is effectively leapfrogging on sorted IDs, so that many concurrent connections can be maintained. Ex: 1 connection fetches ids 1-100, second one fetches ids 101-200, etc, and then when the first returns maybe it fetches ids 1001-1100. Is this relatively standard practice? Is there a faster method for pulling the data out for processing?
So long as I am below the database's max_connections, would it be arguably beneficial to add more, possibly as additional concurrent applications streaming data out of it? Both the application and the RDS are currently in the VPC, and the CPU utilization on the RDS gets to about 30%, with memory at 60%.
It would likely be MUCH faster to dump your Postgres database into a CSV file or export it directly to flat files, dump the flat files into S3 after splitting them up, then have workers process each batch of files on their own.
Streaming data out of Postgres (particularly if you're doing it for millions of items) will take a LOT of IO and a very long time.
I am using this memcached package with nodejs. As default max size of data per key is 1mb i am facing problem when data is more than 1mb for a particular key.
One work around would be in memcache.conf setting default max size more than 1 mb using
-I 2M
and in code setting the maxValue
var memcached = new Memcached('localhost:11211', {maxValue: 2097152});
What would be proper way to stay in 1mb limit? I have read suggestion about splitting data into multiple keys. How can i achieve multiple key splitting with JSON data in memcached package.
Options available :
1/ Make sure you are using compression while storing them in memcached, your nodejs memcached driver would be supporting gzip compression.
2/ Split the data into multiple keys
3/ Increase max object size to more than 1 MB ( but that may increase fragmentation,decrease performance based on your cache usage )
4/ Use redis as cache instead of memcached if your object sizes are usually large. Redis string datatype supports objects upto 512 MB in size, that would be easily available as direct get,set interface in any standard nodejs-redis cache driver
I am curious whether VoltDB compresses the data on disk/at rest.
If it does, what is the algorithm used and are there options for 3rd party compression methods (e.g. a loss permitted proprietary video stream compression algorithm)?
VoltDB uses Snappy compression when writing Snapshots to disk. Snappy is an algorithm optimized for speed, but it still has pretty good compression. There aren't any options for configuring or customizing a different compression method.
Data stored in VoltDB (e.g. when you insert records) is stored 100% in RAM and is not compressed. There is a sizing worksheet built in to the web interface that can help estimate the RAM required based on the specific datatypes of the tables, and whatever indexes you may define.
One of the datatypes that is supported is VARBINARY which stores byte arrays, i.e. any binary data. You could store pre-compressed data in VARBINARY columns, or use a third-party java compression library within stored procedures to compress and decompress inputs. There is a maximum size limit of 1MB per column, and 2MB per record, however a procedure could store larger sized binary data by splitting it across multiple records. There is a maximum size of 50MB for the inputs to or the results from a stored procedure. You could potentially store and retrieve larger sized binary data using multiple transactions.
I saw you posted the same question in our forum, if you'd like to discuss more back and forth, that is the best place. We'd also like to talk to you about your solution, so if you like I can contact you at the email address from your VoltDB Forum account.