FIO Flexible IO tester for repetitive data access patterns - linux

I am currently working on a project and I need to test my prototype with repetitive data access patterns. I came across fio which is a flexible I/O tester for Linux (1).
Fio has many options and I want it to produce a workload which accesses the same blocks of a file, the same number of times over and over again. I also need those accesses to not be equal among these blocks. For instance, if fio creates a file named "test.txt"
and this file is divided on 10 blocks, I need the workload to read a specific number of these blocks, with different number of IOs each, over and over again. Let's say that it chooses to access block 3, 7 and 9. Then I want to access these in a specific order and a specific number of times each, over and over again. If this workload can be described by N passes, then I want to be something like this:
1st pass: read block 3 10 times, read block 7 5 times, read block 9 2 times.
2nd pass: read block 3 10 times, read block 7 5 times, read block 9 2 times.
...
N-pass: read block 3 10 times, read block 7 5 times, read block 9 2 times.
Question 1: Can the above workload be produced with Fio? If yes, How?
Question 2: Is there a mailing list, forum, website, community for Fio users?
Thank you,
Nick

http://www.spinics.net/lists/fio/index.html This is the website you can follow mailing list.
http://www.bluestop.org/fio/HOWTO.txt link will also help you.

This is actually quite a tricky thing to do. The closest you'll get with parameters is using one of the non-uniform distributions (see random_distribution in the HOWTO) but you'll be saying re-read blocks A, B, C more than blocks X, Y, Z and you won't be able to control the exact counts.
An alternative is to write an iolog that can be replayed that has the exact sequence you're looking for (see Trace file format v2 in the HOWTO).

Related

Gray code fifo getting number of elements in buffer

I have 2 different clocks, one for reading and one for writing. I am using gray-code to synchronize the pointers with an additional 2 flip-flops for synchronization on the differnt clock of the input signal.
The articles that I have read indicate how to determine the full and empty signal using gray code by comparing the 2MSB for full state and equality for empty state.
However, I need to get the number of elements in the buffer and not just the full or empty signals. Is this possible to do with gray code?
In a comment you ask about the common clock and mentioned that your depth is not a power of two.
First : Edit your original post and add that question and the information.
Second: In an a-synchronous FIFO there is no common clock. The write operations are all run from the write clock. The read operations are all run from the read clock. The critical part is to exchange information between the clock domains. That is where the gray code comes in.
Third: An a-synchronous FIFO uses gray code because only one bit changes at a time. Important there is that the process is circular. Thus the difference between your last and your first value also only differs by one bit:
Counter Gray-code
000 000
001 001
010 011
011 010
100 110
101 111
110 101
111 100 <-- Last
000 000 <-- First again
This works if and only if the depth (and thus the counters) are a power of two. Therefore an a-synchronous FIFO always has a depth which is a power of two.
If you must have a different depth you can add a synchronous FIFO to the beginning or the end. However if you think about it: a FIFO is just an elastic buffer. The behavior if it is e.g. 16 entries deep or 12 entries is not different, other then that you have the potential to store more values.
Last: As supercat said: You convert from binary to Gray code, cross to the other clock domain, then convert Gray code to binary again.
In the end clock domain you can safely compare read and write counters to determine the fill-level of the FIFO.
If the level is needed on both read and write side you have to implement this process twice, once in each clock domain.
The most understandable way to compute the difference between two gray-code values is to synchronize them with a common clock, convert them to binary, and then do an ordinary binary subtraction on them. While it may be possible to design a fully-combinatorial circuit that would compute the difference between two gray-code values in such a way that if all bits of one particular value are stable, and one bit in the other value changes, only one bit in the output would change and all others would remain stable, such a design would be much more complicated than one which simply synchronizes both counters, converts to binary, and subtracts.

Buffer overflow exploitation 101

I've heard in a lot of places that buffer overflows, illegal indexing in C like languages may compromise the security of a system. But in my experience all it does is crash the program I'm running. Can anyone explain how buffer overflows could cause security problems? An example would be nice.
I'm looking for a conceptual explanation of how something like this could work. I don't have any experience with ethical hacking.
First, buffer overflow (BOF) are only one of the method of gaining code execution. When they occur, the impact is that the attacker basically gain control of the process. This mean that the attacker will be able to trigger the process in executing any code with the current process privileges (depending if the process is running with a high or low privileged user on the system will respectively increase or reduce the impact of exploiting a BOF on that application). This is why it is always strongly recommended to run applications with the least needed privileges.
Basically, to understand how BOF works, you have to understand how the code you have build gets compiled into machine code (ASM) and how data managed by your software is stored in memory.
I will try to give you a basic example of a subcategory of BOF called Stack based buffer overflows :
Imagine you have an application asking the user to provide a username.
This data will be read from user input and then stored in a variable called USERNAME. This variable length has been allocated as a 20 byte array of chars.
For this scenario to work, we will consider the program's do not check for the user input length.
At some point, during the data processing, the user input is copied to the USERNAME variable (20bytes) but since the user input is longer (let's say 500 bytes) data around this variable will be overwritten in memory :
Imagine such memory layout :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
If you define the 3 local variables USERNAME, variable2 and variable3 the may be store in memory the way it is shown above.
Notice the RETURN ADDRESS, this 4 byte memory region will store the address of the function that has called your current function (thanks to this, when you call a function in your program and readh the end of that function, the program flow naturally go back to the next instruction just after the initial call to that function.
If your attacker provide a username with 24 x 'A' char, the memory layout would become something like this :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
new data [AAA...AA][ AAAA ][variable3][RETURN ADDRESS]
Now, if an attacker send 50 * the 'A' char as a USERNAME, the memory layout would looks like this :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
new data [AAA...AA][ AAAA ][ AAAA ][[ AAAA ][OTHER AAA...]
In this situation, at the end of the execution of the function, the program would crash because it will try to reach the address an invalid address 0x41414141 (char 'A' = 0x41) because the overwritten RETURN ADDRESS doesn't match a correct code address.
If you replace the multiple 'A' with well thought bytes, you may be able to :
overwrite RETURN ADDRESS to an interesting location.
place "executable code" in the first 20 + 4 + 4 bytes
You could for instance set RETURN ADDRESS to the address of the first byte of the USERNAME variable (this method is mostly no usable anymore thanks to many protections that have been added both to OS and to compiled programs).
I know it is quite complex to understand at first, and this explanation is a very basic one. If you want more detail please just ask.
I suggest you to have a look at great tutorials like this one which are quite advanced but more realistic

Cluster nodes need to read different sections of an input file - how do I organize it?

I am trying to read an input file in a cluster environment. Different nodes will read different parts of it. However the parts are not clearly separated, but interleaved in a "grid".
For example, a file with 16 elements (assume integers):
0 1 2 3
4 5 6 7
8 9 A B
C D E F
If I use four nodes, the first node will read the top left 2x2 square (0,1,4,5), the second node will read the top right 2x2 square and so on.
How should I handle this? I can use MPI or OpenMP. I have two ideas but I don't know which would work better:
Each node will open the file and have its own handle to it. Each node would read the file independently, using only the part of the file it needs and skipping over the rest of it. In this case, what would be the difference between using fopen or MPI_File_open? Which one would be better?
Use one node read the whole file and send each part of the input to the node that needs it.
Regarding your question,
I will not suggest the second option you mentioned. that is using one node to read and then distributing the parts. Reasons being this is slow .. especially if the file is large. Here you have twice the overhead, first to keep other processes waiting and second to send the data which is read. So clearly a no go for me.
Regarding your first option, there is no big difference between using fopen and MPI_Fole_open. But Here I will still suggest MPI_File_open to avail certain facilities like non blocking I/O operations and Shared file pointers (makes life easy)

file copy from one to 100 servers [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I want to transfer a 100 GB file that resides on a single server to 100 other servers in the network over 1 Gbps line. What is the best way to do it ? My solution is copy the file to k number of servers(say 9) and then assign the remaining (100-9) servers to each of the of the 9 servers.
This is a way better solution then copying the file from 1 server to 100 sequentially. My question is how to determine k ? or what is the calculation to determine the most efficient value of k. Please suggest if there is any better solution too. sorry forgot to mention .. CANNOT USE TORRENT. not all companies allow for torrent. This is an interview question. Appreciate your response. Thanks
Assuming that you can copy to only one server at a time, it can go as follows.
Main server copies to Server S1.
S1 copies to S2 (1 copy)
S1 copies to S3 and S2 copies to S4 (2 copies in parallel)
S1 copies to S5, S2 copies to S6, S3 copies to S7, S4 copies to S8 (4 copies in parallel)
And so on..
So, the pattern of the number of copies is as follows: 2 pow 0, 2 pow 1, 2 pow 2 etc
1 + 2 + 4 + 8 + 16 + 32 + 64 > 100
So, the number of copies S1 has to do can be found with this formula
(2 pow k >= 100) and (2 pow (k-1) < 100)
In this case, k evaluates to 7 (After the first copy)
Let there be n servers to which the files to be copied. Your approach is correct if copying can be done in parallel, i.e. after the initial round of copying there will be k servers with a copy of the file. If copying from these k servers to the remaining n-k servers can be done in parallel then your approach is ideal.
You can find the value of k as follows,
Select k such that k2 ≤ n and (k+1)2 > n.
One opinion is to multicast file on a network. This way first server will only send file once(and other servers receive the file all simultaneously). It can get really tricky, but I imagine this would be the fastest way. You probably need to devise your own custom protocol, what to do when one computer loses packet.
https://github.com/windsurfer/C-Sharp-Multicast
I know for the interview it may be too late, but for the record perhaps
you could consider something like this:
https://code.google.com/p/castcopy/
or some other multicast copy tool. No need to repeat the packets for each
or some of the receiving clients. You just send one copy of the packet and
all others listen at the same time!
Pan
If you use bittorrent to distribute the file over your lan then the torrent software will take care of load-balancing for you i.e. you don't need to precompute "k." I recommend using utorrent for your clients, but any client will do. Here is a tutorial for setting up the tracker etc
An advantage of using bittorrent is that the recipient servers can start distributing chunks of the file before they have the entire file.
Under simplistic assumptions you could treat this as a dynamic programming problem: for i = 1.. k find the fastest way to produce k copies. At each step consider the time taken to produce k-t copies in previous steps and then add on 1 step to run t copy operations in parallel, where t had better be no larger than k - t.
For the case where k is a power of two, you can produce 2 copies (counting the original) in 1 step, 4 copies in 2 steps... 128 copies in 7 steps, which is quicker than it would take to do the 9 copies that are your first stage, assuming that running 9 copies out of a single machine takes 9 times as long as copying to a single destination.
But all of this assumes that the time taken by a copy depends only on the outgoing bandwidth of the source - in practice I would expect that either all your network links are close together and the same, so that multiple copies at the same time risk slowing each other down, or your network links are widely separated but different, so that copies over different links take different amounts of time.
You should also consider sneakernet - copy to removable USB or removable hard drive and carry the device to its destination for another local copy. Historically, attempting to replace relatives of sneakernet with network links without working out the effective bandwidth of the existing sneakernet have failed by not providing enough network bandwidth.
I can think of Divide and Conquer
100 (50,50) -> (25 , 25) -> (12 , 13) -> (6 , 6) -> (3 ,3) -> (1 , 2) ..STOP
I am assuming the copy function will try to use the local resource (e.g Server 1 to Server 2) Server 1 resource will be used.
So from Server 1 to Server 2 and 3 (total 3 servers)
Now Server 1 to 4 , 2 to 5 , 3 to 6 (total 6 Servers)
Now Server 1 to 7 , 2 to 8 , 3 to 9....6 to 12 (total 12 Servers)
So let's Say a thread manager will copy Server 1 to Server 51 , Server 2 to Server 52 ... Server 50 to Server 100
bzip the file to compress it as much as possible
rsync it to all the other machines
Go for lunch/ Work on the next thing in your stack.
No time limit was mentioned so why assume one. It just makes things harder for yourself.
Two steps:
S00 (server one, the guy having the file initially) splits the file in 100 chunks, not saving the chunks to disk, but instead sending chunks C01-C99 to S01-S99 respectively.
S00-S99 sends their chunk to their siblings, but of cause nobody sends to S00
Expect network to be saturated pretty badly!

Variable substitution faster than in-line integer in Vic-20 basic?

The following two (functionally equivalent) programs are taken from an old issue of Compute's Gazette. The primary difference is that program 1 puts the target base memory locations (7680 and 38400) in-line, whereas program 2 assigns them to a variable first.
Program 1 runs about 50% slower than program 2. Why? I would think that the extra variable retrieval would add time, not subtract it!
10 PRINT"[CLR]":A=0:TI$="000000"
20 POKE 7680+A,81:POKE 38400+A,6:IF A=505 THEN GOTO 40
30 A=A+1:GOTO 20
40 PRINT TI/60:END
Program 1
10 PRINT "[CLR]":A=0:B=7600:C=38400:TI$="000000"
20 POKE B+A,81:POKE C+A,6:IF A=505 THEN GOTO 40
30 A=A+1:GOTO 20
40 PRINT TI/60:END
Program 2
The reason is that BASIC is fully interpreted here, so the strings "7680" and "38400" need to be converted to binary integers EVERY TIME line 20 is reached (506 times in this program). In program 2, they're converted once and stored in B. So as long as the search-for-and-fetch of B is faster than convert-string-to-binary, program 2 will be faster.
If you were to use a BASIC compiler (not sure if one exists for VIC-20, but it would be a cool retro-programming project), then the programs would likely be the same speed, or perhaps 1 might be slightly faster, depending on what optimizations the compiler did.
It's from page 76 of this issue: http://www.scribd.com/doc/33728028/Compute-Gazette-Issue-01-1983-Jul
I used to love this magazine. It actually says a 30% improvement. Look at what's happening in program 2 and it becomes clear, because you are looping a lot using variables the program is doing all the memory allocation upfront to calculate memory addresses. When you do the slower approach each iteration has to allocate memory for the highlighted below as part of calculating out the memory address:
POKE 7680+A,81:POKE 38400+A
This is just the nature of the BASIC Interpreter on the VIC.
Accessing the first defined variable will be fast; the second will be a little slower, etc. Parsing multi-digit constants requires the interpreter to perform repeated multiplication by ten. I don't know what the exact tradeoffs are between variables and constants, but short variable names use less space than multi-digit constants. Incidentally, the constant zero may be parsed more quickly if written as a single decimal point (with no digits) than written as a digit zero.

Resources