I m trying to implement a DHT using Kademlia paper as a way of understanding better how these systems work.
I have read some other articles that referes to this way of implementing a distributed hash table but there is something I can't wrap my head around.
In a p2p exchange files network, key ids could be implemented as digest of filename for consistency throught search mecanism.
But what about node id itself ?
Should I, for example, take the digest of "WAN-IP:PORT" combination or simply generate a completly random id from scratch ?
In the second case scenario, there is alway a risk that two nodes generate the same id. But using my WAN-IP:PORT I rely on the fact that my client is running a node behind a WAN-IP that should never change.
How should it come from the network itself ? I mean, on first contact, the peer work some magic and give the new node an id ?
I would appreciate some inputs on how to implement Kademlia node id generation.
As stated in Kademlia paper, "Node IDs are currently just random 160-bit identifiers, though they could equally be constructed as in Chords".
Chords uses the SHA-1 digest of a node IP.
Related
Okay, I've been reading articles and the paper about Kademlia recently to implement a simple p2p program that uses kademlia dht algorithm. And those papers are saying, those 160-bit key in a Kademlia Node is used to identify both nodes (Node ID) and the data (which are stored in a form of tuple).
I'm quite confused on that 'both' part.
As far as my understanding goes, each node in a Kademlia binary tree uniquely represents a client(IP, port) who each holds a list of files.
Here is the general flow on my understanding.
Client (.exe) gets booted
Creates a node component
Newly created node joins the network (bootstrapping)
Sends find_node(filehash) to k-closest nodes
Let's say hash is generated by hashing file binary named file1.txt
Received nodes each finds the queried filehash in its different hash table
Say, a hash map that has a list of files(File Hash, file location)
Step 4,5 repeated until the node is found (meanwhile all associated nodes are updating the buckets)
Does this flow look all right?
Additionally, bootstrapping method of Kademlia too confuses me.
When the node gets created (user executes the program), It seems like it uses bootstrapping node to fill up the buckets. But then what's bootstrapping node? Is it another process that's always running? What if the bootstrapping node gets turned off?
Can someone help me better understand the concept?
Thanks for the help in advance.
Does this flow look all right?
It seems roughly correct, but your wording is not very precise.
Each node has a routing table by which it organizes the neighbors it knows about and another table in which it organizes the data it is asked to store by others. Nodes have a quasi-random ID that determines their position in the routing keyspace. The hashes of keys for stored data don't precisely match any particular node ID, so the data is stored on the nodes whose ID is closest to the hash, as determined by the distance metric. That's how node IDs and key hashes are used for both.
When you perform a lookup for data (i.e. find_value) you ask the remote nodes for the k-closest neighbor set they have in their routing table, which will allow you to home in on the k-closest set for a particular target key. The same query also asks the remote node to return any data they have matching that target ID.
When you perform a find_node on the other hand you're only asking them for the closest neighbors but not for data. This is primarily used for routing table maintenance where you're not looking for any data.
Those are the abstract operations, if needed an actual implementation could separate the lookup from the data retrieval, i.e. first perform a find_node and then use the result set to perform one or more separate get operations that don't involve additional neighbor lookups (similar to the store operation).
Since kademlia is UDP-based you can't really serve arbitrary files because those could easily exceed reasonable UDP packet sizes. So in practice kademlia usually just serves as a hash table for small binary values (e.g. contact information, public keys and such). Bulk operations are either performed by other protocols bootstrapped off those values or by additional operations beyond those mentioned in the kademlia paper.
What the paper describes is only the basic functionality for a routing algorithm and most basic key value storage. It is a spherical cow in a vacuum. Actual implementations usually need additional features or work around security and reliability problems faced on the public internet.
But then what's bootstrapping node? Is it another process that's always running? What if the bootstrapping node gets turned off?
That's covered in this question (by example of the bittorrent DHT)
I am studying back-end programming, specifically with Node.js and ExpressJS and currently it baffles me how does the "keys" prop of cookie-session library help us? What is the point of it? Have been reading a lot of different materials related to authentication, sessions and etc, but the answer to this particular question remains to be ambiguous to me.
Could someone give me an in-depth explanation, preferably both ways: in simple terms and using programming lexicon, regarding this topic?
to explain it in simple terms:
it's essentially using different keys (rotating the keys) every certain time period to encrypt the data; so that the data breach from one key can be contained/limited; or let's say a key can be cracked in x months, then rotating the key - using a different key every x-1 months to reduce the probability of data being compromised.
This question actually belongs to crypto stack exchange and is kinda hard to describe and out of the scope to include in the docs. The search also doesn't return any accurate and results unless you search specifically for methods/algorithms of key rotation.
visit these to get some conceptual overview and in-depth examples:
What's the purpose of key-rotation?
(recommended)
Key Rotation for Authenticated Encryption
And these for more in-depth technical and mathematical reference:
Fully Key-Homomorphic Encryption, Arithmetic Circuit ABE and Compact Garbled Circuits?
Fast and Secure Updatable Encryption
Whatever values (inside the array) are provided for keys prop are used to encrypt and decrypt the user.id / sessions.id that we store in the cookie of our browser.
I'm trying to understand how Kademlia works in regards to finding a resource. There is pretty good description of now to build a node tree which is closest to the self node, how to find the distance between nodes, how to initiate the process etc. What I don't understand is how the file infohash fits into this picture. All descriptions tell us how to get into the play and build your own part of the distributed hash table but it is not it. We are doing this to actually find a resource, a file with a certain infohash. How it is stored in this node tree or there is a separate one? How is it works to find nodes which have this infohash, consequently having the file.
There is brief mentioning of the fact that the node id and infohash having the same 20 bytes length codes and something that node id XOR infohash is the distance between the node and the resource but I cannot imagine how is that and how it helps to find the resource? After all a node id actually having the resource can have the greatest XOR distance to the resource.
Thank you,
Alex
I recommend that you don't just read the bittorrent DHT specification but also the original kademlia paper, since the former is fairly concise and only mentions some things in passing.
Bittorrent's get_peers lookup is equivalent to the find_value operation described in the paper.
In short: just like you can do an iterative lookup to find the K-closest-node-set - closest based on xor-distance relative to the target key - for your own node's ID you can do so for any other ID.
For get_peers you simply use the infohash as target key.
The K-closest-node-set for a particular infohash is the set of nodes considered responsible to store the data for said infohash. Although due to inaccuracies of implementations and node churn more than K nodes around the target key may be storing data of interest.
Using nodejs and crypto, right now, when a user logs in, I generate a random auth token:
var token = crypto.randomBytes(16).toString('hex');
I know it's unlikely, but there is a tiny chance for two tokens to be of the same value.
This means a user could, in theory, authenticate on another account.
Now, I see two obvious methods to get pass this:
When I generate the token, query the user's database and see if a
Token with the same value already exists. If it does, just generate another one. As you can see, this is not perfect since I am adding queries to the database.
Since every user has a unique username in my database, I could
generate a random token using the username as a secret generator key.
This way, there is no way of two tokens having the same value. Can crypto do that? Is it secure?
How would you do it?
It's too unlikely to worry about it happening by chance. I would not sacrifice performance to lock and check the database for it.
Consider this excerpt from Pro Git about the chance of random collisions between 20-byte SHA-1 sums:
Here’s an example to give you an idea of what it would take to get a
SHA-1 collision [by chance]. If all 6.5 billion humans on Earth were programming,
and every second, each one was producing code that was the equivalent
of the entire Linux kernel history (1 million Git objects) and pushing
it into one enormous Git repository, it would take 5 years until that
repository contained enough objects to have a 50% probability of a
single SHA-1 object collision. A higher probability exists [for average projects] that every
member of your programming team will be attacked and killed by wolves
in unrelated incidents on the same night.
(SHA-1 collisions can be directly constructed now, so the quote is now less applicable to SHA-1, but it's still valid when considering collisions of random values.)
If you are still worried about that probability, then you can easily use more random bytes instead of 16.
But regarding your second idea: if you hashed the random ID with the username, then that hash could collide, just like the random ID could. You haven't solved anything.
You should always add a UNIQUE constraint to your database column. This will create an implicit index to improve searches for this column and it will make sure that none of two records will ever has the same value. So, in the worst-case scenario you will get a database exception and not a security violation.
Also, depending on how frequently unique tokens are needed to be created, I think it's perfectly fine in most cases to use database lookups during generation. If your column, again, is properly indexed, it will be a pretty fast query. Most databases a very well horizontally scalable, so if your are building a next Facebook it is again an option. Furthermore, you will probably need to do a query to check for E-Mail uniqueness anyway.
Finally, if you are really concerned about performance you could always pre-generate a one-million of unique tokens and store them in the separate database table for quick use. Just setup a routine to periodically check it's usage and insert more records to it as needed. However, as #MacroMan stated in the comments, this could have a security implications if someone will get access to the list of pre-generated tokens, so this practice should be avoided.
PostgreSQL UNIQUE CONSTRAINT
MySQL: Unique Constraints
I know how data is (in theory) stored in a DHT. However, I am uncertain as to how one might go about updating a piece of data associated with a key. Is this possible? Also, how are conflicts handled in a DHT.
A DHT simply defines put(key,value) and get(key) operations and the core of the various DHT algorithms revolve around how to locate the nodes responsible for a specific key.
What those nodes do on an incoming put request for a value already stored largely depends on the purpose and implementation of the DHT network, not on the algorithm itself.
E.g. a node might opt to timestamp all incoming values and return lists with multiple separate timestamped issues. Or it might return lists that also include the source address for each value. Or they might just overwrite the stored value.
If you have some relation between the key and a signature within the value or the source ID or something like that you can put enough intelligence into the nodes to verify the data cryptographically and thus allow them to keep a single canonical value for each key by replacing the old data.
In the case of bittorrent's DHT you wouldn't want that. Many different bittorrent peers announce their presence to a single key from different source addresses. Therefore the nodes actually store unique <key,IP,port> tuples where <IP,port> can be considered the value. Which means it'll return lists of IPs and ports on each lookup. And since a DHT will have multiple nodes responsible for one key you will actually have K (bucket size) nodes responding with varying lists.
TL;DR: It's implementation-dependent
It is possible. I've researched pastrys dht. It is possible to alter data stored under a given key but pastrys developers advise against it as it can have nasty side effects, mainly with replications of the altered piece of data which is stored on other nodes. (see the FAQ on freepastrys home page).
I'm not sure about how it would effect other dhts such as chord or tapestry however.
With regard to conflicts, again I have only experience with pastry. If you try to store data under a key that's already in use an exception will be thrown.