I read the paper about Kademlia available here.
I don't understand how the number k is chosen (for the k-bucket).
I don't understand this sentence: "k is chosen such that any given k nodes are very unlikely to fail within an hour of each other."
I don't understand how a node failing?
Kademlia is an abstract algorithm.
Individual implementations can choose their own k based on expected characteristics of the nodes in the network.
For example if you wanted to form a small overlay of a few hundred nodes in a highly reliable datacenter then k = 2 may be sufficient.
Bittorrent uses k = 8 with lots of domestic (read: quite unreliable) nodes scattered over the whole internet and it does its job, but its job is not particularly demanding. So one can't infer that that's the upper practical limit from that alone.
I don't understand how a node failing?
Computers crash, go offline, change internet connections, reboot, are put into hibernation. All those are things that effectively are node failures from the perspective of the network.
Here is what I've managed to find about significance of k-bucket value on Kademlia network behavior:
Based on Improving Community Management Performance with Two-Level Hierarchical DHT Overlays paper:
Low k means more fragmented network
High k means lower number of hops during lookup, but higher maintenance traffic
The value of k (referring to bucket size, a Kademlia-specific parameter) has a major impact for the operation of the Kademlia DHT. On the first hand, the value should not be set too low; otherwise the network could become fragmented, complicating or even preventing the routing of messages between some peers. On the second hand, the value should not be set too high or a significant amount of unnecessary load from maintenance traffic would be inflicted on the network.
What should be the exact amount (tested on networks with size 100 and 500):
The measurements indicated that k values of 1
and 2 were insufficient to prevent the fragmentation of the
network. The chosen k value of 3 was enough to achieve a
consistent network structure in both of the network sizes.
With the k value of 4 and larger, the nodes’ knowledge of
the network develops further, but at the cost of increased
maintenance traffic. Even though the average hop count
decreases, the larger routing table induces more KeepAlive
messaging.
The "maintanance traffic" is amount of KeepAlive messages being sent. KeepAlive messages are being sent to all devices in k-buckets to ensure that connections are alive. If we didn't send them, we could one day end up with no connections, unable to participate in network. Although in this paper, they are sending several such messages per minute, I'm not sure that that much is in real life necessary.
The other use-case of k-buckets was in paper linked in this question: Kademlia: A Peer-to-peer Information System Based on the XOR Metric. I ommitted it at first, as I thought it was an original Kademlia paper. Although authors here are the same, it turns out it's content is different and they are actually talking about significance of k-buckets:
They promote being in network longer (the more stable node you are, the more significance in network you have)
They prevent DoS attacks. Flooding network with new nodes won't be destructive, since old nodes will still take place in k-buckets
A second benefit of k-buckets is that they provide resistance to certain DoS attacks. One cannot flush nodes’ routing state by flooding the system with new nodes. Kademlia nodes will only insert the new nodes in the k-buckets when old nodes leave the system
So I guess it's the second factor when choosing k value: higher k means your network is easier to hijack due to Sybil/Eclipse attack
Related
Let's say I'm building something like AWS Lambda / Cloudflare Workers, where I allow users to submit arbitrary binaries, and then I run them wrapped in sandboxes (e.g. Docker containers / gVisor / etc), packed multitenant-ly onto a fleet of machines.
Ignore the problem of ensuring the sandboxing is effective for now; assume that problem is solved.
Each individual execution of one of these worker-processes is potentially a very heavy workload (think SQL OLAP reports.) A worker-process may spend tons of CPU, memory, IOPS, etc. We want to allow them to do this. We don't want to limit users to a small fixed slice of a machine, as traditional cgroups limits enable. Part of our service's value-proposition is low latency (rather than high throughput) in answering heavy queries, and that means allowing each query to essentially monopolize our infrastructure as much as it needs, with as much parallelization as it can manage, to get done as quickly as possible.
We want to charge users in credits for the resources they use, according to some formula that combines the CPU-seconds, memory-GB-seconds, IO operations, etc. This will disincentivize users from submitting "sloppy" worker-processes (because a process that costs us more to run, costs them more to submit.) It will also prevent users from DoSing us with ultra-heavy workloads, without first buying enough credits to pay the ensuing autoscaling bills in advance :)
We would also like to enable users to set, for each worker-process launch, a limit on the total credit spend during execution — where if it spends too many CPU-seconds, or allocates too much memory for too long, or does too many IO operations, or any combination of these that adds up to "spending too many credits", then the worker-process gets hard-killed by the host machine. (And we then bill their account for exactly as many credits as the resource-limit they specified at launch, despite not successfully completing the job.) This would protect users (and us) from the monetary consequences of launching faulty/leaky workers; and would also enable us to predict an upper limit on how heavy a workload could be before running it, and autoscale accordingly.
This second requirement implies that we can't do the credit-spend accounting after the fact, async, using observed per-cgroup metrics fed into some time-series server; but instead, we need each worker hypervisor to do the credit-spend accounting as the worker runs, in order to stop it as close to the time it overruns its budget as possible.
Basically, this is, to a tee, a description of the "gas" accounting system in the Ethereum Virtual Machine: the EVM does credit-spend accounting based on a formula that combines resource-costs for each op, and hard-kills any "worker process" (smart contract) that goes over its allocated credit (gas) limit for this launch (tx and/or CALL op) of the worker.
However, the "credit-spend accounting" in the EVM is enabled by instrumenting the VM that executes code such that each VM ISA op also updates a gas-left-to-spend VM register, and aborts VM execution if the gas-left-to-spend ever goes negative. Running native code on bare-metal/regular IaaS VMs, we don't have the ability to instrument our CPU like that. (And doing so through static binary translation would probably introduce far too much overhead.) So doing this the way the EVM does it, is not really an option.
I know Linux does CPU accounting, memory accounting, etc. Is there a way, using some combination of cgroups + gVisor-alike syscall proxying, to approximate the function of the EVM's "tx gas limit", i.e. to enable processes to be hard-killed (instantly/within a few ms of) when they go over their credit limit?
I'm assuming there's no off-the-shelf solution for this (haven't been able to find one after much research.) But are the right CPU counters + kernel data structures + syscalls in place to be able to develop such a solution, and to have it be efficient/low-overhead?
Is there an upper limit to the suggested size of the value stored for a particular key in Redis?
Is 100KB too large?
There are two things that you need to take into consideration when deciding if something is "too big".
Does Redis have support for the size of key/value object that you want to store?
The answer to this question is documented pretty well on the Redis site (https://redis.io/topics/data-types), so I won't go into detail here.
For a given key/value size, what are the consequences I need to be aware of?
This is a much more nuanced answer as it depends heavily on how you are using Redis and what behaviors are acceptable to your application and which ones are not.
For instance, larger key/value sizes can lead to fragmentation of the memory space within your server. If you aren't using all the memory in your Redis server anyway, then this may not be a big deal to you. However, if you need to squeeze all of the memory out of your Redis server you can, then you are now reducing the efficiency of how memory is allocated and you are losing access to some memory that you would otherwise have.
As another example, when you are reading these large key/value entries from Redis, it means you have to transfer more data over the network from the server to the client. Some consequences of this are:
It takes more time to transfer the data, so your client may need to have a higher timeout value configured to allow for this additional transfer time.
Requests made to the server on the same TCP connection can get stuck behind the big transfer and cause other requests to timeout. See here for an example scenario.
Your network buffers used to transfer this data can impact available memory on the client or server, which can aggravate the available memory issues already described around fragmentation.
If these large key/value items are accessed frequently, this magnifies the impacts described above as you are repeatedly transferring this data over and over again.
So, the answer is not a crisp "yes" or "no", but some things that you should consider and possibly test for your expected workload. In general, I do advise our customers to try to stay as small as possible and I have often said to try to stay below 100kb, but I have also seen plenty of customers use Redis with larger values (in the MB range). Sometimes those larger values are no big deal. In other cases, it may not be an issue until months or years later when their application changes in load or behavior.
Is there an upper limit to the suggested size of the value stored for a particular key in Redis?
According to the official docs, the maximum size of key(String) in redis is 512MB.
Is 100KB too large?
It depends on the application and use, for a general purpose applications it should be fine.
I am new on High Performance Computing (HPC), but I am going to have a HPC project, so I need some help to solve some fundamental problems.
The application scenario is simple: Several servers connected by the InfiniBand (IB) network, one server for Master, and others for slaves. only the master read/write in-memory data (size of the data ranges from 1KB to several hundred MBs) into slaves, while slaves just passively store the data in their memories ( and dump the in-memory data into disks at the right time). All computation are are performed in the Master, before the writing or after the reading the data to/from the slaves. The requirement of the system is low latency (small regions of data, such as 1KB-16KB) and high throughput (large regions of data, several hundred MBs).
So, My questions are
1. Which concrete way is more suitable for us? MPI, primitive IB/RDMA library or ULPs over RDMA.
As far as I know, existing Message Passing Interface (MPI) library, primitive IB/RDMA library such as libverbs and librdmacm and User Level Protocal (ULPs) over RDMA might be feasible choices, but I am not very sure of their applicable scopes.
2. Should I make some tunings for the OS or the IB network for better performance?
There is a paper [1] from Microsoft announces that
We improved performance
by up to a factor of eight with careful tuning and
changes to the operating system and the NIC drive
For my part, I will try to avoid such performance tuning as I can. However, if the tuning is unavoidable, I will try my best. The IB network of our environment is Mellanox InfiniBand QDR 40Gb/s and I can choose the Linux OS for servers freely.
If you have any ideas, comments and answers are welcome!
Thanks in advance!
[1] FaRM: Fast Remote Memory
If you use MPI, you will have the benefit of an interconnect-independent solution. It doesn't sound like this is going to be something you are going to keep around for 20 years, but software lasts longer than you ever think it will.
Using MPI also gives you the benefit of being able to debug on your (oversusbscribed, possibly) laptop or workstation before rolling it out onto the infiniband machines.
As to your second question about tuning the network, I am sure there is no end of tuning you can do but until you have some real workloads and hard numbers, you're wasting your time. Get things working first, then worry about optimizing the network. Maybe you need to tune for many tiny packages. Perhaps you need to worry about a few large transfers. The tuning will be pretty different depending on the case.
I have a custom protocol overlaying TCP that can be described as follows:-
Client sends a packet A to the server. The server ACKS the packet A.
Client sends a packet B.
In other words, at any point in time, there is only one unacknowledged packet. Hence, the factors that play into account for sending messages as fast as possible are:-
How soon can a packet arrive at the destination. This implies least amount of fragmentation
done by TCP. If a packet can arrive in a single segment as opposed to 5 segments, the quicker the server can respond to it.
The unit of work done by the server for that packet. At present, i am not focused on this point, though eventually, i will touch it also.
Also assume, the rate of loss is negligible.
Nagle is disabled.
Typical packet sizes vary from 1KB to 3KB.
Bandwidth is 1Gb/sec
I am thinking that if i configure MTU equal to the biggest message size (3KB + headers), this should impact the number of messages that i can send in a second. My question is that are there any negative consequences in changing MTU. This application runs inside the LAN in a managed environment.
Alternatively, if i set the don't fragment flag, would it be equivalent to the above change?
First, let's clarify the difference between MTU and MSS. These belong to different layers of the stack (2 and 3).
TCP/IP is a quite unfortunate layered cake, both of which support fragmentation but differently, and they do not cooperate on this matter.
IP fragmentation is something that TCP is unaware of. In fact, if one of the IP fragments lost, the whole series is declared lost. Not so is for TCP: if one of the IP datagrams which are part of the same TCP stream is lost, and they were fragmented by the TCP, only retransmit of the lost parts is required.
The core reason for this mess is that a router must be able to impedance-match between two physical networks with different MTUs without understanding the higher (TCP) protocol.
Now, all modern networks support "jumbo frames" (you have to configure your NIC to be able to send jumbo frames; all modern NICs will always be able to receive frames up to 90xx bytes).
As usual with increasing MTU, it is
not useful unless you increase MSS
improves performance (bandwidth), and
hurts performance (zero-load latency to first byte)
In some applications, like, for example, Gigalinx implementation og GigE vision, increasing MTU is a requirement. Over fast networks the overhead of 1500 byte MTU is intolerable.
As an architect, the thing to ask yourself is what is your application actually doing. If there is a "relevant packet size", in sense that "until first 3kB of data received there's nothing to do with the rest", and you really need this tiny performance edge, increase MTU. Before doing that, consider dropping TCP altogether in favor of more ethernet-friendly protocol, and of course do not implement it yourself but choose something like ZeroMQ which works good.
Second question: Do not fragment is an IP setting. Typically useful only in routers, which are expected to match networks of different MTU. It means "discard packet unless I can relay it to the other network". If this is sometimes the case, TCP cannot work over this layer. It will try to retransmit and fail again and again, and eventually disconnect and further behavior will depend on what application is doing. This is a typical situation on the internet, with public misconfigured wifi networks and home networks. You can sometimes browse facebook but not practically watch anything on youtube. This is why. Network administrators would never know the reason.
MSS = Maximum segment size = the amount of data sent in one TCP packet.
Decreasing the MSS will reduce performance as the data will be fragmented to more TCP packets.
Increasing the MSS past its correct value will cause fragmentation on the link layer (ethernet).
TCP already tries to find (per-connection) the largest possible MSS that doesn't cause fragmentation. Unless this fails (it doesn't), there's no need to override this value. Link-layer fragmentation should be avoided. It can save very little and it can easily hurt the performance as well.
Don't touch MSS unless you know what you're doing. It has its value for a good reason.
I would look to find the speed of communication between the two cores of a computer.
I'm in the very early stages of planning to massively parallelise a sequential program and I need to think about network communication speeds vs. communication between cores on a single processor.
Ubuntu linux probably provides some way of seeing this sort of information? I would have thought speed fluctuates.. I just need some average value. I'm basically needing to write something up at the moment and it would be good to talk about these ratios.
Any ideas?
Thanks.
According to this benchmark: http://www.dragonsteelmods.com/index.php?option=com_content&task=view&id=6120&Itemid=38&limit=1&limitstart=4 (Last image on the page)
On an Intel Q6600, inter-core latency is 32 nanoseconds. Network latency is measured in milliseconds which 1,000,000 milliseconds / nanosecond. "Good" network latency is considered around or under 100ms, so given that, the difference is about the order of 1 million times faster for inter-core latency.
Besides latency though there's also Bandwidth to consider. Again based on the linked bookmark, benchmark for that particular configuration, inter-core bandwidth is about 14GB/sec whereas according to this: http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321-3.html, real-world test of a Gigabit ethernet connection shows about 35.8MB/sec so the difference there is smaller, only on the order of around 500 times faster in terms of bandwidth as opposed to a 1,000,000 times in latency. Depending on which is more important to your application that might change your numbers.
The network speeds are measured in milliseconds for Ethernet ($5-$100/port), or microseconds for specialized MPI hardware like Dolphin on Myrintet (~ $1k/port). Inter-core speeds are measured in nanoseconds, as the data is copied from one memory area to another, and then some signal is sent from one CPU to another (the data will be protected from simultaneous access by a mutex or a full-bodied queue).
So, using a back'o'the'napkin calculation the ratio is 1:10^6.
Inter-core communication is going to be massively faster. Why ?
the network layer imposes a massive overhead in terms of packets, addressing, handling contention etc.
the physical distances will impose a sizeable impact
How you measure inter-core communication speed would be very difficult. But given the above I think it's a redundant calculation to make.
This is a non-trivial thing to find. The speed of data transfer between two cores depends entirely on the application. It could depend on any (or all) of - the speed of register access, the clock speed of the cores, the system bus speed, the latency of your cache, the latency of your memory, etc., etc., etc. In short, run a benchmark or you'll be guessing in the dark.