How to update entries in a DHT - p2p

I know how data is (in theory) stored in a DHT. However, I am uncertain as to how one might go about updating a piece of data associated with a key. Is this possible? Also, how are conflicts handled in a DHT.

A DHT simply defines put(key,value) and get(key) operations and the core of the various DHT algorithms revolve around how to locate the nodes responsible for a specific key.
What those nodes do on an incoming put request for a value already stored largely depends on the purpose and implementation of the DHT network, not on the algorithm itself.
E.g. a node might opt to timestamp all incoming values and return lists with multiple separate timestamped issues. Or it might return lists that also include the source address for each value. Or they might just overwrite the stored value.
If you have some relation between the key and a signature within the value or the source ID or something like that you can put enough intelligence into the nodes to verify the data cryptographically and thus allow them to keep a single canonical value for each key by replacing the old data.
In the case of bittorrent's DHT you wouldn't want that. Many different bittorrent peers announce their presence to a single key from different source addresses. Therefore the nodes actually store unique <key,IP,port> tuples where <IP,port> can be considered the value. Which means it'll return lists of IPs and ports on each lookup. And since a DHT will have multiple nodes responsible for one key you will actually have K (bucket size) nodes responding with varying lists.
TL;DR: It's implementation-dependent

It is possible. I've researched pastrys dht. It is possible to alter data stored under a given key but pastrys developers advise against it as it can have nasty side effects, mainly with replications of the altered piece of data which is stored on other nodes. (see the FAQ on freepastrys home page).
I'm not sure about how it would effect other dhts such as chord or tapestry however.
With regard to conflicts, again I have only experience with pastry. If you try to store data under a key that's already in use an exception will be thrown.

Related

What does it mean by Kademlia keys are used to identify nodes as well as data?

Okay, I've been reading articles and the paper about Kademlia recently to implement a simple p2p program that uses kademlia dht algorithm. And those papers are saying, those 160-bit key in a Kademlia Node is used to identify both nodes (Node ID) and the data (which are stored in a form of tuple).
I'm quite confused on that 'both' part.
As far as my understanding goes, each node in a Kademlia binary tree uniquely represents a client(IP, port) who each holds a list of files.
Here is the general flow on my understanding.
Client (.exe) gets booted
Creates a node component
Newly created node joins the network (bootstrapping)
Sends find_node(filehash) to k-closest nodes
Let's say hash is generated by hashing file binary named file1.txt
Received nodes each finds the queried filehash in its different hash table
Say, a hash map that has a list of files(File Hash, file location)
Step 4,5 repeated until the node is found (meanwhile all associated nodes are updating the buckets)
Does this flow look all right?
Additionally, bootstrapping method of Kademlia too confuses me.
When the node gets created (user executes the program), It seems like it uses bootstrapping node to fill up the buckets. But then what's bootstrapping node? Is it another process that's always running? What if the bootstrapping node gets turned off?
Can someone help me better understand the concept?
Thanks for the help in advance.
Does this flow look all right?
It seems roughly correct, but your wording is not very precise.
Each node has a routing table by which it organizes the neighbors it knows about and another table in which it organizes the data it is asked to store by others. Nodes have a quasi-random ID that determines their position in the routing keyspace. The hashes of keys for stored data don't precisely match any particular node ID, so the data is stored on the nodes whose ID is closest to the hash, as determined by the distance metric. That's how node IDs and key hashes are used for both.
When you perform a lookup for data (i.e. find_value) you ask the remote nodes for the k-closest neighbor set they have in their routing table, which will allow you to home in on the k-closest set for a particular target key. The same query also asks the remote node to return any data they have matching that target ID.
When you perform a find_node on the other hand you're only asking them for the closest neighbors but not for data. This is primarily used for routing table maintenance where you're not looking for any data.
Those are the abstract operations, if needed an actual implementation could separate the lookup from the data retrieval, i.e. first perform a find_node and then use the result set to perform one or more separate get operations that don't involve additional neighbor lookups (similar to the store operation).
Since kademlia is UDP-based you can't really serve arbitrary files because those could easily exceed reasonable UDP packet sizes. So in practice kademlia usually just serves as a hash table for small binary values (e.g. contact information, public keys and such). Bulk operations are either performed by other protocols bootstrapped off those values or by additional operations beyond those mentioned in the kademlia paper.
What the paper describes is only the basic functionality for a routing algorithm and most basic key value storage. It is a spherical cow in a vacuum. Actual implementations usually need additional features or work around security and reliability problems faced on the public internet.
But then what's bootstrapping node? Is it another process that's always running? What if the bootstrapping node gets turned off?
That's covered in this question (by example of the bittorrent DHT)

Are dummy partition keys always bad?

I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether. By dummy, I mean a column whose only purpose is to contain the same value for all rows, thereby putting all data on 1 node and giving the lowest possible cardinality. For example:
dummy | id | name
-------------------------
0 | 01 | 'Oliver'
0 | 02 | 'James'
0 | 03 | 'Nicholls'
The two main points in regards to why you should avoid dummy partition keys are:
1) You end up with data "hot-spots". There is a lot of data stored on 1 node so there's more traffic around that node and you have poor distribution around the cluster.
2) Partition space is finite. If you put all data on one partition, it will eventually be incapable of storing any more data.
I can understand these points and I agree that you definitely want to avoid those situations, so I put this idea out of my mind and tried to think of a good partition key for my table. The table in question stores sites and there are two common ways that table gets queried in our system. Either a single site is requested or all sites are requested.
This puts me in a bit of an awkward situation, because the table is either queried on nothing or the site ID, and making a unique field the partition key would give me very high cardinality and high latency on queries that request all sites.
So I decided that I'd just choose an arbitrary field that would give relatively low cardinality, even though it doesn't reflect how the data will actually be queried, just because it's better than having a cardinality that is either excessively high or excessively low. This approach also has problems though.
I could partition my data on column x, but we have numerous clients, all of whom use our system differently, so x for 1 client could give the results I'm after, but could give awful results for another.
At this point I'm running out of options. I need a field in my table that will be consistent for all clients, however this field doesn't exist, so I'm now considering having a new field that will contain a random number from 1-3 and then partitioning on that field, which is essentially just a dummy field. The only difference is that I want to randomise the values a little bit as to avoid hot-spots and unbounded row growth.
I know this is a data-modelling question and it varies from system to system, and of course there are going to be situations where you have to choose the lesser of two evils (there is no perfect solution), but what I'm really focussed on with this question is:
Are dummy partition keys something that should outright never be a consideration in Cassandra, or are there situations in which they're seen as acceptable? If you think the former, then how would you approach this situation?
I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether.
I'm going to go out on a limb and guess that your search has yielded my article We Shall Have Order!, where I made my position on the use of "dummy" partition keys quite clear. Bearing that in mind, I'll try to provide some alternate solutions.
I see two potential problems to solve here. The first:
I need a field in my table that will be consistent for all clients, however this field doesn't exist
Typically this is solved by duplicating your data into another query table. That's the best way to serve multiple, varying query patterns. If you have one client (service?) that needs to query that table by site id, then you could have that table duplicated into a table called sites_by_id.
CREATE TABLE sites_by_id (
id BIGINT,
name TEXT,
PRIMARY KEY (id));
The other problem is this query pattern:
all sites are requested
Another common Cassandra anti-pattern is that of unbound SELECTs (SELECT query without a WHERE clause). I am sure you understand why these are bad, as they require all nodes/partitions to be read for completion (which is probably why you are looking into a "dummy" key). But as the table supporting these types of queries increases in size, they will only get slower and slower over time...regardless of whether you execute an unbound SELECT or use a "dummy" key.
The solution here is to perform a re-examination of your data model, and business requirements. Perhaps your data can be split up into sites by region or country? Maybe your client really only needs the sites that have been updated for this year? Obtaining some more details on the client's query requirements may help you find a good partitioning key for them to use. Otherwise, if they really do need all of them all of the time, then doanduyhai's suggestion of using Spark will better fit your use case.
or all sites are requested
So basically you have a full table scan scenario. Isn't Apache Spark over Cassandra a better fit for this use-case ? I suspect it's an analytics use-case, isn't it ?
As far as I understand, you want to access a single site by its id, in which case lookup by partition key is ideal. The other use-case which requires to fetch all the sites is best suited with Spark

In co-located caching, how is determined the location of a key?

This a design question about the Caching component, I can see two approaches in determining where is the data:
Each role instance maintains a table containing the entire set of keys, tracking the corresponding instance holding the data.
The location of the data is determined by the hash code of the key.
In the first case, it would mean that it's important to keep a reasonable set of keys.
In the second case, that testing the existence of a key would generate a network round trip...
My guess is 2), it utilize hash to determine the location, maybe consistent hashing.
And I think yes, testing the existence of a key would generate a network I/O, but I don't think it needs to call all you co-location server since from the hash it should know which server contains your data and just need to connect to it.

Are MongoDB ids guessable?

If you bind an api call to the object's id, could one simply brute force this api to get all objects? If you think of MySQL, this would be totally possible with incremental integer ids. But what about MongoDB? Are the ids guessable? For example, if you know one id, is it easy to guess other (next, previous) ids?
Thanks!
Update Jan 2019: As mentioned in the comments, the information below is true up until version 3.2. Version 3.4+ changed the spec so that machine ID and process ID were merged into a single random 5 byte value instead. That might make it harder to figure out where a document came from, but it also simplifies the generation and reduces the likelihood of collisions.
Original Answer:
+1 for Sergio's answer, in terms of answering whether they could be guessed or not, they are not hashes, they are predictable, so they can be "brute forced" given enough time. The likelihood depends on how the ObjectIDs were generated and how you go about guessing. To explain, first, read the spec here:
Object ID Spec
Let us then break it down piece by piece:
TimeStamp - completely predictable as long as you have a general idea of when the data was generated
Machine - this is an MD5 hash of one of several options, some of which are more easily determined than others, but highly dependent on the environment
PID - again, not a huge number of values here, and could be sleuthed for data generated from a known source
Increment - if this is a random number rather than an increment (both are allowed), then it is less predictable
To expand a bit on the sources. ObjectIDs can be generated by:
MongoDB itself (but can be migrated, moved, updated)
The driver (on any machine that inserts or updates data)
Your Application (you can manually insert your own ObjectID if you wish)
So, there are things you can do to make them harder to guess individually, but without a lot of forethought and safeguards, for a normal data set, the ranges of valid ObjectIDs should be fairly easy to work out since they are all prefixed with a timestamp (unless you are manipulating this in some way).
Mongo's ObjectId were never meant to be a protection from brute force attack (or any attack, for that matter). They simply offer global uniqueness. You should not assume that some object can't be accessed by a user because this user should not know its id.
For an actual protection of your resources, employ other techniques.
If you defend against an unauthorized access, place some authorization logic in your app (allow access to legitimate users, deny for everyone else).
If you want to hinder dumping all objects, use some kind of rate limiting. Combine with authorization if applicable.
Optional reading: Eric Lippert on GUIDs.

Which is better - auto-generated id or manual id assignment in couchdb documents?

Should I be generating the id of the documents in a CouchDB or should I depend on CouchDB to generate it? What are the advantages or disadvantages in these approaches? Is there any performance implications on any of these options?
There is no difference as far as CouchDB is concerned. Frederick is right that sequential ids are slightly faster. If you query /_uuids?count=10 you will notice that the UUIDs are sequential (by default).
However, even with random IDs, once you run compaction, they will all be in the "right" order internally in the .couch file and at that point there is no difference. So in the long run, I don't usually worry about it.
The main thing is that you should use mostly sequential ids. As this article and this bit of the couchdb book explain, using random ids results in a much less efficient structure internally, both speed wise and in terms of space used on disc.
Self generated ids are almost impossible to deal with if you have two or more separated instances of your app. Because the synchronisation between the different instances is not instantaneous. A solution for this can be to have one server dedicated to generate (or check the availability of) the ids, for example using a SQL database, and acting as a gate for document creation.
On the other hand, if you have only one server and will never need more, there is one advantage I find interesting to self generated uids: since they have to be unique, you can use them in urls. For instance take the slug of the title of a blog post as the _id.
Performance-wise, the CouchDB's generated ids are pretty long so if your own ids are shorter, you will save significant disk space (assuming you have a looot of documents).
Both answers above tell about PROS of sequential IDs.
Here is a major problem arose by sequential IDs.
Predictability of other IDs in documents using a single ID.
Due to this we can't use sequential IDs in application URLs as identifiers due to other IDs being predictable using one ID, and using as url authentication is also not possible.( As done by file sharing services).

Resources