Is there a way to leverage the Bittorrent DHT for small data - bittorrent

I have a situation where I have a series of mostly connected nodes that need to sync a pooled dataset. They are files from 200-1500K and update at irregular intervals between 30min to 6hours depending on the environment. Right now the numbers of nodes are in the hundreds, but ideally, that will grow.
Currently, I am using libtorrent right now to keep a series of files in sync between a cluster of nodes. I do a dump every few hours and create a new torrent based on the prior one. I then associate it using the strategy of BEP 38. The infohash is then posted to a known entry in the DHT where the other nodes poll to pick it up.
I am wondering if there is a better way to do this. The reason I like BitTorrent was originally for firmware updates. I do not need to worry about nodes less than awesome connectivity, and with the DHT it can self assemble reasonably well. It was then extended to sync these pooled files.
I am currently trying to see if I can make an extension that would allow me to have each node do an announce_peer for each new record. Then in theory interested parties would be able to listen for that. That brings up two big issues:
How do I let the interested nodes know that there is new data?
If I have a thousand or more nodes adding new infohashes every few minutes what will that do to the DHT?
I will admit it feels like I am trying to drive a square peg into a round hole, but I really would like to keep as few protocols in play at a time.

How do I let the interested nodes know that there is new data?
You can use BEP46 to notify clients of the most recent version of a torrent.
If I have a thousand or more nodes adding new infohashes every few minutes what will that do to the DHT?
It's hard to give a general answer here. Is each node adding a distinct dataset? Or are those thousands of nodes going to participate in the same pooled data and thus more or less share one infohash? The latter should be fairly efficient since not all of them even need to announce themselves, they could just do a read-only lookup, try to connect to the swarm and only do an announce when there are not enough reachable peers. This would be similar to the put optimiation for mutable items

Related

How can I increase the number of peers in my routing table associated with a given infohash

I'm working on a side project and trying to monitor peers on popular torrents, but I can't see how I can get a hold of the full dataset.
If the theoretical limit on routing table size is 1,280 (from 160 buckets * bucket size k = 8) then I'm never going to be able to hold the full number of peers on a popular torrent (~9,000 on a current top-100 torrent)
My concern with simulating multiple nodes is low efficiency due to overlapping values. I would assume that their bootstrapping paths being similar would result in similar routing tables.
Your approach is wrong since it would violate reliability goals of the DHT, you would essentially be performing an attack on the keyspace region and other nodes may detect and blacklist you and it would also simply be bad-mannered.
If you want to monitor specific swarms don't collect data passively from the DHT.
if the torrents have trackers, just contact them to get peer lists
connect to the swarm and get peer lists via PEX which provides far more accurate information than the DHT
if you really want to use the DHT perform active lookups (get_peers) at regular intervals

Single row hotspot

I built a Twitter clone, and the row that stores Justin Bieber’s profile (some very famous person with a lot of followers) is read incredibly often. The server that stores it seems to be overloaded. Can I buy a bigger server just for that row? By the way, it isn’t updated very often.
The short answer is that Cloud Spanner does not offer different server configurations, except to increase your number of nodes.
If you don't mind reading stale data, one way to increase read throughput is to use read-only, bounded-staleness transactions. This will ensure that your reads for these rows can be served from any replica of the split(s) that owns those rows.
If you wanted to go even further, you might consider a data modeling tradeoff that makes writes more expensive but reads cheaper. One way of doing that would be to manually shard that row (for example by creating N copies of it with different primary keys). When you want to read the row, a client can pick one to read at random. When you update it, just update all the copies atomically within a single transaction. Note that this approach is rarely used in practice, as very few workloads truly have the characteristics you are describing.

Move tokens for node one by one or simultaneously?

When adding new node should I really perform nodetool move [token] one by one? Is it a good practice to move tokens for more than one nodes simultaneously? For example, could I process multiple nodes from arbitary not related replica sets in that way?
Datastax docs says "No". I think I can, but I'm not sure.
When you move the token, you should also move the data associated with that token. That could be a lot of data. That data has to travel over your network. There is the danger of saturating your network links if yiu move much data simultaneously. So you can do it. But perhaps you ought nit to.
A rather old Datastax documentation says
[...] and then run nodetool move , one node at a time.

Can CouchDB handle thousands of separate databases?

Can CouchDB handle thousands of separate databases on the same machine?
Imagine you have a collection of BankTransactions. There are many thousands of records. (EDIT: not actually storing transactions--just think of a very large number of very small, frequently updating records. It's basically a join table from SQL-land.)
Each day you want a summary view of transactions that occurred only at your local bank branch. If all the records are in a single database, regenerating the view will process all of the transactions from all of the branches. This is a much bigger chunk of work, and unnecessary for the user who cares only about his particular subset of documents.
This makes it seem like each bank branch should be partitioned into its own database, in order for the views to be generated in smaller chunks, and independently of each other. But I've never heard of anyone doing this, and it seems like an anti-pattern (e.g. duplicating the same design document across thousands of different databases).
Is there a different way I should be modeling this problem? (Should the partitioning happen between separate machines, not separate databases on the same machine?) If not, can CouchDB handle the thousands of databases it will take to keep the partitions small?
(Thanks!)
[Warning, I'm assuming you're running this in some sort of production environment. Just go with the short answer if this is for a school or pet project.]
The short answer is "yes".
The longer answer is that there are some things you need to watch out for...
You're going to be playing whack-a-mole with a lot of system settings like max file descriptors.
You'll also be playing whack-a-mole with erlang vm settings.
CouchDB has a "max open databases" option. Increase this or you're going to have pending requests piling up.
It's going to be a PITA to aggregate multiple databases to generate reports. You can do it by polling each database's _changes feed, modifying the data, and then throwing it back into a central/aggregating database. The tooling to make this easier is just not there yet in CouchDB's API. Almost, but not quite.
However, the biggest problem that you're going to run into if you try to do this is that CouchDB does not horizontally scale [well] by itself. If you add more CouchDB servers they're all going to have duplicates of the data. Sure, your max open dbs count will scale linearly with each node added, but other things like view build time won't (ex., they'll all need to do their own view builds).
Whereas I've seen thousands of open databases on a BigCouch cluster. Anecdotally that's because of dynamo clustering: more nodes doing different things in parallel, versus walled off CouchDB servers replicating to one another.
Cheers.
I know this question is old, but wanted to note that now with more recent versions of CouchDB (3.0+), partitioned databases are supported, which addresses this situation.
So you can have a single database for transactions, and partition them by bank branch. You can then query all transactions as you would before, or query just for those from a specific branch, and only the shards where that branch's data is stored will be accessed.
Multiple databases are possible, but for most cases I think the aggregate database will actually give better performance to your branches. Keep in mind that you're only optimizing when a document is updated into the view; each document will only be parsed once per view.
For end-of-day polling in an aggregate database, the first branch will cause 100% of the new docs to be processed, and pay 100% of the delay. All other branches will pay 0%. So most branches benefit. For end-of-day polling in separate databases, all branches pay a portion of the penalty proportional to their volume, so most come out slightly behind.
For frequent view updates throughout the day, active branches prefer the aggregate and low-volume branches prefer separate. If one branch in 10 adds 99% of the documents, most of the update work will be done on other branch's polls, so 9 out of 10 prefer separate dbs.
If this latency matters, and assuming couch has some clock cycles going unused, you could write a 3-line loop/view/sleep shell script that updates some documents before any user is waiting.
I would add that having a large number of databases creates issues around compaction and replication. Not only do things like continuous replication need to be triggered on a per-database basis (meaning you will have to write custom logic to loop over all the databases), but they also spawn replication daemons per database. This can quickly become prohibitive.

How to make Riak data localized?

I'm designing a Riak cluster at the moment and wondering if it is possible to hint Riak that a specific bunch of keys should be placed on a single node of the cluster?
For example, there is some private data for the user, that only she is able to access. This data contains ~10k documents (too large to be kept in one key/document), and to serve one page, we need to retrieve ~100 of them. It would be better to keep the whole bunch on a single node + have the application on the same instance to make this faster.
AFAIK it is easy on Cassandra: just use OrderedPartitioner and keys like this: <hash(username)>/<private data key>. That way, almost all user keys will be kept on a single node.
One of the points of using Riak is that your data is replicated and evenly distributed throughout the cluster, thus improving your tolerance for network partitions and outages. Placing data on specific nodes goes against that goal and increases your vulnerability.

Resources