MongoDB limit storage size?

MongoDB limit storage size? - 64-bit

What is MongoDB's storage size limit on 64bit platforms? Can MongoDB store 500-900 Gb of data within one instance (node)? What was the largest amount of data you've stored in MongoDB, and what was your experience?

The "production deployments" page on MongoDB's site may be of interest to you. Lots of presentations listed with infrastructure information. For example:
http://blog.wordnik.com/12-months-with-mongodb says they're storing 3 TB per node.

The MongoDB's storage limit on different operating systems are tabulated below as per the MongoDB 3.0 MMAPv1 storage engine limits.
The MMAPv1 storage engine limits each database to no more than 16000 data files. This means that a single MMAPv1 database has a maximum size of 32TB. Setting the storage.mmapv1.smallFiles option reduces this limit to 8TB.
Using the MMAPv1 storage engine, a single mongod instance cannot manage a data set that exceeds maximum virtual memory address space provided by the underlying operating system.
Virtual Memory Limitations
Operating System Journaled Not Journaled
Linux 64 terabytes 128 terabytes
Windows Server 2012 R2
and Windows 8.1 64 terabytes 128 terabytes
Windows (otherwise) 4 terabytes 8 terabytes
Reference: MongoDB Database Limit.
Note:The WiredTiger storage engine is not subject to this limitation.
Another way to have more than 2GB on a single node is to run multiple mongod processes. So sharding is one option or doing some manual partitioning across processes.
Hope This helps.

You won't run anywhere near hitting the cap with 1TB on 64 bit systems, however Mongo does store the indexes in memory so a smooth experience depends on your index size and how much memory you have. But if you have a beefy enough system it won't be a problem.

Related

Cassandra - HDD vs. SSD usage makes no difference in throughput

The Context
I'm currently running tests with Apache Cassandra on a single node cluster. I've ensured the cluster is up and running using nodetool status, I've done a multitude of reads and writes that suggest as such, and I'm confident my cluster is set up properly. I am now attempting to speed up my throughput by mounting a SSD onto the directory where Cassandra writes its data to.
My Solution
The write location of Cassandra data is generally to /var/lib/cassandra/data, however I've since switched mine using cassandra.yaml to write to another location, where I've mounted my SSD. I've ensured that Cassandra is writing to this location by checking the size of the data directory's contents through watch du -h and other methods. The directory I've mounted the SSD on includes table data, commitlog, hints, a nested data directory, and saved_caches.
The Problem
I've been using YCSB benchmarks (see https://github.com/brianfrankcooper/YCSB) to test the average throughput and ops/sec of Cassandra. I've noticed no difference in the average throughput when mounting HDD vs. SSD on the location where Cassandra writes its data to. I've analyzed disk access through dstat -cd --disk-util --disk-tps and found HDD caps out on CPU usage in multiple instances whereas SSD only spikes to around 80% on several occassions.
The Question
How can I speed up the throughput of Cassandra using a SSD over a HDD? I assume this is the correct place to mount my SSD, but does Cassandra not utilize its extra processing power? Any help would be greatly appreciated!

SSD should always win over the HDD in terms of latency, etc. It's just a law of physics. I think that your test simply didn't provide enough load on the system. Another problem could be that you mount only data to SSD, but not the commit logs - on HDDs they should be always put onto a separate disk to avoid clashes with data load. On SSDs they could be put on the same disk as data - please point all directories to SSD to see a difference.
I recommend to perform a comparison by using following tools:
perfscripts - it uses fio tool to emulate Cassandra-like workloads, and if you run it on the both HDDs & SSDs, then you will see the difference in latency. You may not even execute it - just look historic folder, where there are results for different disk types;
DSBench - it was recently released by DataStax team, who is specializing in benchmarking Cassandra and DSE. There are built-in workloads described in wiki, that you can use for testing. Only make sure that you run the load long enough to see the effect of compaction, etc.

POC on Cassandra and PowerBI Report server

1.I have been given task to set up hardware for Cassandra DB( preferably on VM). For now, Cassandra has 100 gb of data and data ingestion is at 500 bytes per every 2 seconds.What kind of hardware/VM should i use?
We need Power-bi Report server to connect to this DB, i plan to use The CData ODBC Driver to establish the connection. Considering the above config will i face any issues w.r.t performance or connection?
Thanks,
Karthik

To your first part:
Your incoming data rate is 250byte/s. For a single year, this is about (raw) 8GB - which is quite small and should even fit into a virtual machine. Keep in mind that your storage used on disk will be higher than this as there is overhead for internal structures as well as for replication (if you need high availability).
But I don't recommend VMs for Cassandra as they often use shared storage for their images which can be a real performance killer due to noisy neighbours and latency. This issue can be less relevant when SSDs or NVMe storage is used.
For the second part: I don't know much more from PowerBI apart from its name. But there is/was an ODBC driver for Cassandra from DataStax:
https://www.datastax.com/dev/blog/using-the-datastax-odbc-driver-for-apache-cassandra
Maybe that helps.

GPDB:Out of memory at segment

we re facing OOM error when trying to execute multiple SQL query session via scheduled job .
Detailed error:
The error message is: org.postgresql.util.PSQLException:ERROR: Out of memory (seg6 slice5 sungpmsh0:40002 pid=13610)
Detail: VM protect failed to allocate 65584 bytes from system, VM Protect 5835 MB available
We tried
After reading the pivotal support doc, we are doing basic troubleshoot here
validated two memory parameters here
current setting in GPdb
GPDB vmprotect limit :8 GB
GPB statemen_mem: based on the vmprotect limit.as per reading it is responsible for running the query in the segment.
Test 2 Did Tuning the SQL queries. also, what should I tune here please guide?
Based on source
https://discuss.pivotal.io/hc/en-us/articles/201947018-Pivotal-Greenplum-GPDB-Memory-Configuration
https://discuss.pivotal.io/hc/en-us/articles/204268778-What-are-VM-Protect-failed-to-allocate-d-bytes-d-MB-available-error-
But still getting the same OOM error.
Do we need to increase the vmprotect limit? if Yes, then by which amount should we increase it?
How to handle concurrency at gpdb?
How much swap we need to add here when we are already running with 30 GB RAM.
currently, we have added 15GB swap here? is that ok ?
What is the query to identify host connection with Greenplum database ?
Thanks in advance

Do we need to increase the vmprotect limit? if Yes, then by which amount should we increase it?
There is a nice calculator on setting gp_vmem_protect_limit on Greenplum.org. The setting depends on how much memory, swap, and segments per host you have.
http://greenplum.org/calc/
You can be getting OOM errors for several reasons.
Bad query
Bad table distribution (skew)
Bad settings (like gp_vmem_protect_limit)
Not enough resources (RAM)
How to handle concurrency at gpdb?
More RAM, less segments per host, and workload management to limit the number of concurrent queries running.
How much swap we need to add here when we are already running with 30 GB RAM. currently, we have added 15GB swap here? is that ok ?
Only 30GB of RAM? That is pretty small. You can add more swap but it will slow down the queries compared to real RAM. I wouldn't use much more than 8GB of swap.
I recommend using 256GB of RAM or more especially if you are worried about concurrency.
What is the query to identify host connection with Greenplum database
select * from pg_stat_activity;

azure virtual machine size capabilities

I am quite a novice on Azure and I am bit stuck trying to understand virtual machine size and features.
I have just deployed "Hortonworks Sandbox with HDP 2.4" virtual machine template on a DS3_v2 machine, which seems to have following features: 4Cores, 14GB Ram, 8 data disks and 28Gb ssds that it is pretty decent to run a proof of concept, however i have some doubts. I am not sure about the total disk size available on this machine if its 200GB or 100GB, does this size include the os vhd? I understand i cant attach till 8 data disk from my storage account sum up either 100GB or 200GB.
DS3_v2 machine also includes Azure premium storage that i think it referes to 28GB ssds, I guess i could have two ssd data disks of 14 GB each?
I really appreciate any insight about these doubts.
Thank you very much.

Your OS disk is a different one, the 28 GB SSD is a local disk and is a temporary disk(think of it as D:\ where your OS is in C:) and the data present on this is not guaranteed during hardware failures. The data on 8 disks you can attach are persisted and you can choose (GRS or RA-GRS Geo Redundant storage) and each of these disks can be upto 1 TB (around 1023 GB) which means you can attach a total of around 8 TB storage data disks to DS3_V2 instance.

Hardware requirements for Presto

I suspect the answer is "it depends", but is there any general guidance about what kind of hardware to plan to use for Presto?
Since Presto uses a coordinator and a set of workers, and workers run with the data, I imagine the main issues will be having sufficient RAM for the coordinator, sufficient network bandwidth for partial results sent from workers to the coordinator, etc.
If you can supply some general thoughts on how to size for this appropriately, I'd love to hear them.

Most people are running Trino (formerly PrestoSQL) on the Hadoop nodes they already have. At Facebook we typically run Presto on a few nodes within the Hadoop cluster to spread out the network load.
Generally, I'd go with the industry standard ratios for a new cluster: 2 cores and 2-4 gig of memory for each disk, with 10 gigabit networking if you can afford it. After you have a few machines (4+), benchmark using your queries on your data. It should be obvious if you need to adjust the ratios.
In terms of sizing the hardware for a cluster from scratch some things to consider:
Total data size will determine the number of disks you will need. HDFS has a large overhead so you will need lots of disks.
The ratio of CPU speed to disks depends on the ratio between hot data (the data you are working with) and the cold data (archive data). If you just starting your data warehouse you will need lots of CPUs since all the data will be new and hot. On the other hand, most physical disks can only deliver data so fast, so at some point more CPUs don't help.
The ratio of CPU speed to memory depends on the size of aggregations and joins you want to perform and the amount of (hot) data you want to cache. Currently, Presto requires the final aggregation results and the hash table for a join to fit in memory on a single machine (we're actively working on removing these restrictions). If you have larger amounts of memory, the OS will cache disk pages which will significantly improve the performance of queries.
In 2013 at Facebook we ran our Presto processes as follows:
We ran our JVMs with a 16 GB heap to leave most memory available for OS buffers
On the machines we ran Presto we didn't run MapReduce tasks.
Most of the Presto machines had 16 real cores and used processor affinity (eventually cgroups) to limit Presto to 12 cores (so the Hadoop data node process and other things could run easily).
Most of the servers were on a 10 gigabit networks, but we did have one large old crufty cluster using 1 gigabit (which worked fine).
We used the same configuration for the coordinator and the workers.
In recent times, we ran the following:
The machines had 256 GB of memory and we ran a 200 GB Java heap
Most of the machines had 24-32 real cores and Presto was allocated all cores.
The machines had only minimal local storage for logs, with all table data remote (in a proprietary distributed file system).
Most servers had a 25 gigabit network connection to a fabric network.
The coordinators and workers had similar configurations.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string