Splitting tablets in YugabyteDB over multiple directories using fs_data_dirs - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
I did some reading in the docs and found fs_data_dirs. Does yugabyte-db automatically split the tablets evenly in the data dirs?

The flag fs_data_dirs sets the directory or directories for the tablet server or master where it will store data on the filesystem. This should be specified as a comma-separated list.
This data is logging, metadata, and data. The first directory will get the logging, all the directories will get the WAL and rocksdb databases. The tablets which is the storage foundation of a table or index are distributed over the directories in a round-robin fashion. This indeed happens completely automatically.
It might be confusing to talk about splitting because when a YSQL table or secondary index is created, the create statement allows you to explicitly define how many tablets the object is split, which is what is distributed over the directories specified.
At the risk of making it confusing, there is another feature which is called automatic tablet splitting, which is a feature set by the flag ‘--enable_automatic_tablet_splitting’ set in the masters, which is the mechanism of making YugabyteDB automatically split tablets when it deems tablets getting too big, and thus allows you to start with a single tablet, which will then be split automatically.

Related

Custom controlled partitioning

I recently posted question and I received full answer. But I am encountering another problem.
Case scenario is the same as in my recent question.
How can I configure member to own partition key?
e.g. DataCenterOnRussia partition key must always be owned by member1 and DataCenterOnGermany partition key must always be owned by member2.
So member2 could request data from DataCenterOnRussia using PartitionAwareKey.
The intent of the PartitionAwareKey is to allow for data affinity ... orders for a customer should be stored in the same partition as the customer record, for example, since they are frequently accessed together.
The PartitionAwareKey allows grouping items together, but not a way to specify the placement of those items on a specific cluster member. (I guess if there were such a thing, it would likely be called MemberAwareKey).
A cluster in Hazelcast isn't a fixed-size entity; it is dynamically scalable, so members might be added or removed, and it is fault-tolerant, so a member could be lost without loss of the data that happened to be on that member. In order to support those features, the cluster must have the freedom to move partitions around to different machines as the cluster topology changes.
Hazelcast recommends that all members of a cluster be similarly configured (equivalent memory configuration, most particularly) because of the idea that cluster members are interchangeable, at least as far as data storage. (The MemberSelector facility does provide a provision for handling systems that have different processing capability, e.g., number of processor cores; but nothing similar exits to allow placement of specific data entries or partitions on a designated member).
If your use case requires specific placement on machines, it's an indication that those machines probably should not be part of the same cluster.

Couchdb database design options

Is it recommended to have a separate database for each document type in couchdb or place all types of documents in a single database?
Is there any limitation on the number of databases that we can create on couchdb?
Are there any drawbacks in creating large number of databases in couchdb?
There is no firm answer. Here are some guidelines:
If two documents must be visible to different sets of users, they must be in different DBs (read/write privs are per-DB, not per-doc).
If two documents must be included in the same view, they must be in the same DB (views are for a single DB only).
If two types of documents will be numerous and never be included in the same view, they might as well be in different DBs (so that accessing a view over one type won't need to process all of the docs of the other type).
It's cheap to drop a database, but expensive to delete all of the documents out of a database. Keep this in mind when designing your data expiration plan.
Nothing hardcoded, but you will eventually start running into resource constraints, depending on the hardware you have available.
Depends on what you mean by "large numbers." Thousands are fine; billions probably not (though with the Cloudant changes coming in v2.0.0 I'd guess that the reasonable cap on DB count probably goes up).

Cassandra multiple disk per node setup

Intro
I have a cassandra 1.2 cluster, all the nodes have SSDs. Now I want to add more disks to the existing nodes, but I want to be able to choose which tables are stored on different disks.
Problem
For example, node 1 will have 3 SSDs and 1 regular disk drive and I want all the column families except 1 (let's call it "discord" table) to be stored on the SSDs only, the final table "discord" needs to be stored on the regular disk.
According to the documentation this should be possible; however, the only way of doing it that I can see is:
Setting up Cassandra to use multiple data_files_directories in cassandra.yaml.
Creating the tables.
Creating a link from the data directory on each SSD to the directory on the hard disk where I want to store the column family.
Question
Is this the only way of doing it? Or there is a simpler way of configuring a node to work in this way?
You can set multiples files using the data_file_directories property, but the data is distributed over the folders internally by Cassandra. You can not take decisions on which keyspace or column family goes to each directory.
So the symbolic links is the way to go in my opinion.

How does cassandra split keyspace data when multiple directories are configured?

I have configured three separate data directories in cassandra.yaml file as given below:
data_file_directories:
- E:/Cassandra/data/var/lib/cassandra/data
- K:/Cassandra/data/var/lib/cassandra/data
when I create keyspace and insert data my key space got created in both two directories and data got scattered. what I want to know is how cassandra splits the data between multiple directories?. And what is the rule behind this?
You are using the JBOD feature of Cassandra when you add multiple entries under data_file_directories. Data is spread evenly over the configured drives proportionate to their available space.
This also let's you take advantage of the disk_failure_policy setting. You can read about the details here:
http://www.datastax.com/dev/blog/handling-disk-failures-in-cassandra-1-2
In short, you can configure Cassandra to keep going, doing what it can if the disk becomes full or fails completely. This has advantages over RAID0 (where you would effectively have the same capacity as JBOD) in that you do not have to replace the whole data set from backup (or full repair) but just run a repair for the missing data. On the other hand, RAID0 provides higher throughput (depending how well you know how to tune RAID arrays to match filesystem and drive geometry).
If you have the resources for fault-tolerant/more performant RAID setup (like RAID10 for example), you may want to just use a single directory for simplicity. Most deployments are starting to lean towards the density route, using JBOD rather than systems-level tolerance though.
You can read about the thought process behind the development of this issue here:
https://issues.apache.org/jira/browse/CASSANDRA-4292
Some what I am able to guess how the keyspace is split between multiple data directories. Based on the maximum available space and load on directories, SSTables of same column family written to the different data directories..

Can CouchDB handle thousands of separate databases?

Can CouchDB handle thousands of separate databases on the same machine?
Imagine you have a collection of BankTransactions. There are many thousands of records. (EDIT: not actually storing transactions--just think of a very large number of very small, frequently updating records. It's basically a join table from SQL-land.)
Each day you want a summary view of transactions that occurred only at your local bank branch. If all the records are in a single database, regenerating the view will process all of the transactions from all of the branches. This is a much bigger chunk of work, and unnecessary for the user who cares only about his particular subset of documents.
This makes it seem like each bank branch should be partitioned into its own database, in order for the views to be generated in smaller chunks, and independently of each other. But I've never heard of anyone doing this, and it seems like an anti-pattern (e.g. duplicating the same design document across thousands of different databases).
Is there a different way I should be modeling this problem? (Should the partitioning happen between separate machines, not separate databases on the same machine?) If not, can CouchDB handle the thousands of databases it will take to keep the partitions small?
(Thanks!)
[Warning, I'm assuming you're running this in some sort of production environment. Just go with the short answer if this is for a school or pet project.]
The short answer is "yes".
The longer answer is that there are some things you need to watch out for...
You're going to be playing whack-a-mole with a lot of system settings like max file descriptors.
You'll also be playing whack-a-mole with erlang vm settings.
CouchDB has a "max open databases" option. Increase this or you're going to have pending requests piling up.
It's going to be a PITA to aggregate multiple databases to generate reports. You can do it by polling each database's _changes feed, modifying the data, and then throwing it back into a central/aggregating database. The tooling to make this easier is just not there yet in CouchDB's API. Almost, but not quite.
However, the biggest problem that you're going to run into if you try to do this is that CouchDB does not horizontally scale [well] by itself. If you add more CouchDB servers they're all going to have duplicates of the data. Sure, your max open dbs count will scale linearly with each node added, but other things like view build time won't (ex., they'll all need to do their own view builds).
Whereas I've seen thousands of open databases on a BigCouch cluster. Anecdotally that's because of dynamo clustering: more nodes doing different things in parallel, versus walled off CouchDB servers replicating to one another.
Cheers.
I know this question is old, but wanted to note that now with more recent versions of CouchDB (3.0+), partitioned databases are supported, which addresses this situation.
So you can have a single database for transactions, and partition them by bank branch. You can then query all transactions as you would before, or query just for those from a specific branch, and only the shards where that branch's data is stored will be accessed.
Multiple databases are possible, but for most cases I think the aggregate database will actually give better performance to your branches. Keep in mind that you're only optimizing when a document is updated into the view; each document will only be parsed once per view.
For end-of-day polling in an aggregate database, the first branch will cause 100% of the new docs to be processed, and pay 100% of the delay. All other branches will pay 0%. So most branches benefit. For end-of-day polling in separate databases, all branches pay a portion of the penalty proportional to their volume, so most come out slightly behind.
For frequent view updates throughout the day, active branches prefer the aggregate and low-volume branches prefer separate. If one branch in 10 adds 99% of the documents, most of the update work will be done on other branch's polls, so 9 out of 10 prefer separate dbs.
If this latency matters, and assuming couch has some clock cycles going unused, you could write a 3-line loop/view/sleep shell script that updates some documents before any user is waiting.
I would add that having a large number of databases creates issues around compaction and replication. Not only do things like continuous replication need to be triggered on a per-database basis (meaning you will have to write custom logic to loop over all the databases), but they also spawn replication daemons per database. This can quickly become prohibitive.

Resources