Where would a Distributed Hash Table be used in place of BitTorrent? - p2p

The lack of recent research into the field of DHTs in comparison to BitTorrent (Tribler being the most significant research project) has lead me to wonder about the usage of DHTs.
Both BitTorrent and Distributed Hash Tables provide a method for distributing content among peers using a key-value like datastore. What are the use cases where a DHT would be more applicable than using BitTorrent?

BitTorrent and most file sharing applications are built on unstructured peer-to-peer overlay networks.
A DHT is a structured peer-to-peer network overlay.
Structured and unstructured peer to peer networks differ mainly in their routing algorithms. Unstructured P2P networks rely on flooding or heuristic searches. Searches aren't necessarily guaranteed to find a file that its looking for.
Whereas with a DHT (structured P2P network), barring a network error or some abnormality, it is guaranteed a file stored under a given key will be found on request. (I've done a lot of performance testing with free pastry and its extremely reliable)
A DHT would be more suitable to applications where the file stored in the P2P network MUST be found. With BitTorrent, I guess it's not essential that each file is found by every request.

Related

Implementing peer discovery in libp2p

Is peer discovery in libp2p (e.g. peers telling each other about peers they know about, and managing lists of connected nodes) in Rust controlled entirely at the level of a NetworkBehavior?
It looks like one option is to use Kademlia which looks like it does this (in the rust version) by defining a NetworkBehavior.
Is it correct that if you don't want to use Kademlia to implement peer discovery, you do this by defining peer discovery as part of your NetworkBehavior?
I'm trying to avoid a situation whereby I start to implement code to do this, but then I find that libp2p is actually doing this for me under the covers.
You have several alternatives, but of course you have to implement a behavior (or combination of behaviors) to discover peers:
mDNS
It allows peers to discover each other when they are on the same local network without any configuration. It is obviously the simplest discovery mode, but limited to local networks.
This is the example.
Rendezvous
It's goal is to provide a lightweight mechanism for generalized peer discovery. As its name indicates, it requires that there be nodes that act as rendezvous. In the protocol implementation examples you can see it better.
Kademlia
This is the best option in the context of a network with many nodes, where a portion of these nodes may offer limited connectivity. It's simpler than it seems, but at the time we did not find practical examples, and we learned through trial and error.
Some of my colleagues are preparing a tutorial series to be published soon, to share our experience with libp2p in Rust.

Hyperledger-fabric use cases

I'm currently looking to securely replicate hundreds of Gbs of data across a few hundred hosts. I was looking at hyperledger-fabric private blockchain because of its use of TLS and peer to peer gossip protocol for data transmission, plus of course the security of the blockchain itself.
Is it reasonable for me to be considering using blockchain as a way to securely do data replication? I have not seen this in any blockchain use case, but from what I've read it seems reasonable even though everything I've read seems to indicate storing data in the blockchain is a bad idea. Usually the arguments are that it costs too much and the data has to be replicated across all the peers in the system. Cost isn't a concern in this case because its a private blockchain and for my use case the data replication (if it can be done efficiently) is what I'm looking for.
I could use ipfs, swift, S3, etc. to store the data, but that would add operational burden, especially if hyperledger-fabric can do the job on its own.
Also, if I use hyperledger private data collections, how much control over purging do I have? For my use cases, I can't just purge the oldest data as in some cases older data needs to be preserved for a long time and in some cases newer data can be purged fairly quickly.
On the subject of data replication:
TL;DR; Not a blockchain solution
Here's my thinking behind that.
Storing large amounts of data isn't a good idea as you've mentioned. Yes there's the replication side of the data across. (but that's a side-effect needed in this case). But also there's the signing and validation etc that nees to take place across all that data. So the costs in terms of processing would mean it would inefficient.
Definition of securely.. You don't say what quality of service would constitute 'secure'. For example
Access Control for users to access the data?
Assurance that the data has been replicated and is on disk at remote locations without corruption?
Encryption of data to protect it in transit and at rest.
Blockchain, and I'm thinking Hyperledger Fabric here, would offer you the assurance. But there's no encryption in transit, you'd need to add that. And access control, the primitives are there but required you to implement and use them.
I would tend to think of the use of Blockchain in this scenario would be to provide the audit trail of how the data was replicated between hosts, with some other protocol.
On the subject of private data collection purging:
Currently this is implemented by purging data when the peer reaches a certain block height. i.e. purge after 42 blocks. But we're working on a feature to allow 'purge-on-demand' based on a call from the chaincode.

Hyperledger Fabric private data collection to distribute large files

We are currently researching on Hyperledger Fabric and from the document we know that a private data collection can be set up among some subset of organizations. There would be a private state DB (aka. side DB) on each of these organizations and per my understanding, the side DB is just like a normal state DB which normally adopts CouchDB.
One of our main requirements is that we have to distribute files (e.g. PDFs) among some subset of the peers. Each file has to be disseminated and stored at the related peers, so a centralized storage like AWS S3 or other cloud storage / server storage is not acceptable. As the file maybe large, the physical copies must be stored and disseminate off-chain. The transaction block may only store the hash of these documents.
My idea is that we may make use of the private data collection and the side DB. The physical files can be stored in the side DB (maybe in the form of base64string?) and can be distributed via Gossip Protocol (a P2P protocol) which is a feature in Hyperledger Fabric. The hash of the document along with other transaction details can be stored in a block as usual. As they are all native features by Hyperledger Fabric, I expect the transfer of the files via Gossip Protocol and the creation of the corresponding block will be in-sync.
My question is:
Is this way feasible to achieve the requirement? (Distribution of the files to different peers while creating a new block) I kinda feel like it is hacky.
Is this a good way / practice to achieve what we want? I have been doing research but I cannot find any implementation similar to this.
Most of the tutorial I found online pre-assumes that the files can be stored in a single centralized storage like cloud or some sort of servers, while our requirement demands a distribution of the files as well. Is my idea described above acceptable and feasible? We are very new to Blockchain and any advice is appreciated!
Is this way feasible to achieve the requirement? (Distribution of the files to different peers while creating a new block) I kinda feel like it is hacky.
So the workflow of private data distribution is that the orderer bundles the private data transaction containing only a hash to verify the data to a new block. So you dont have to do a workaround for this since private data provides this per default. The data itself gets distributed between authorized peers via gossip data dissemination protocol.
Is this a good way / practice to achieve what we want? I have been doing research but I cannot find any implementation similar to this.
Yes and no. Sry to say so. But this depends on your file sizes and amount. Fabric is capable of providing rly high throughput. I would test things out and see if it meets my requirements.
The other approach would be to do a work around and use IPFS (a p2p file system). You can read more about that approach here here
And here is an article discussing storing 'larger files' on chain. Maybe this gives some constructive insights aswell. But keep in mind this is an older article.
Check out IBM Blockchain Document Store, it is the implementation of storing any document (pdf or otherwise) both on and off chain. It has been done.
And while the implementation isn't publicly available, there is vast documentation on it's usage, can probably disseminate some information from it

Security Program - Splitting Files

How would you go about describing the architecture of a "system" that splits a sensitive file into smaller pieces on different servers in order to protect the file?
Would we translate the file into bytes, and then distribute those bytes onto different servers? How would you even go about getting all the pieces back together in order to call the original file back (if you have the correct permissions)?
This is a theoretical problem that I do not know how to approach. Any hints at where I should start?
Not an authoritative answer but you will get many here as replies which provides partial answers to your question. It may just give you some idea.
My guess is, you would be creating a custom file system.
Take a look at various filesystems like
GmailFS: http://richard.jones.name/google-hacks/gmail-filesystem/gmail-filesystem.html
pyfilesystem: http://code.google.com/p/pyfilesystem/
A distributed file system in python: http://pypi.python.org/pypi/koboldfs
Hence architecturally, it will be very similar to way a typical distributed filesystem is implemented.
It should be a client/server architecture in master/slave mode. You will have to create a custom protocol for their communication.
Master process is what you will talk to for retrieving / writing your files.
Slave fs would be distributed across different servers which will keep a tagged file which contains partial bits of information of a file
Master fs will contain a per file entry that locates all sequence of tagged data distributed across various slave servers.
You could have redundancy with a tagged data being store on multiple server.
Communication protocol will have to be designed to allow multiple servers to respond back to requested tagged data. Master fs simply picks one and ignores others in the simplest case.
Usual security requirements needs to be respected for storing and communicating this information across servers.
You will be most interested in secure distributed filesystem implemented in Python : Tahoe
http://tahoe-lafs.org/~warner/pycon-tahoe.html
http://tahoe-lafs.org/trac/tahoe-lafs

What are the best papers for learning about algorithms for communicating updates in a distributed system?

I have a distributed system in mind (multiple nodes in a single datacenter) that I want to have the following properties:
nodes can enter and leave the system at any time.
There is no data replication between nodes.
Which node the client makes use of is up to the client (i.e. it could be consistent hashing, it could be something else)
no master (i.e. no central point of failure)
each node may receive a piece of information that needs to be forwarded to the rest of the nodes
What algorithms (links to papers are best) are suitable for this?
(I assume some of the answers will include P2P algorithms, but most of them that I've encountered in the past have acted more like distributed hash tables, where nodes enter and take over some part of the keyspace, etc. I also recognize that multicast with simple UDP messages might be appropriate here, but what existing work would help make the messaging reliable?)
How about trying to implement ADHOC nodes with JXTA? See the Practical JXTA II book available online at Scribd.

Resources