determining the optimal buffer size for file read in linux - linux

I am writing a C program which reads from stdin and writes to stdout. But it buffers the data so that a write is performed only after it reads a specific number of bytes(=SIZE)
#include<stdio.h>
#include<stdlib.h>
#define SIZE 100
int main()
{
char buf[SIZE];
int n=0;
//printf("Block size = %d\n", BUFSIZ);
while( ( n = read(0, buf, sizeof(buf)) ) > 0 )
write(1, buf, n);
exit(0);
}
Iam running this program on a Ubuntu 18.04 hosted on Oracle Virtual Box(4GB RAM, 2 cores), and testing the program for different values of buffer size. I have redirected the standard input to come from a file(which contains random numbers created dynamically) and standard output to go to /dev/null. Here is the shell script used to run the test:
#!/bin/bash
# $1 - step size (bytes)
# $2 - start size (bytes)
# $3 - stop size (bytes)
echo "Changing buffer size from $2 to $3 in steps of $1, and measuring time for copying."
buff_size=$2
echo "Test Data" >testData
echo "Step Size:(doubles from previous size) Start Size:$2 Stop Size:$3" >>testData
while [ $buff_size -le $3 ]
do
echo "" >>testData
echo -n "$buff_size," >>testData
gcc -DSIZE=$buff_size copy.c # Compile the program for cat, with new buffer size
dd bs=1000 count=1000000 </dev/urandom >testFile #Create testFile with random data of 1GB
(/usr/bin/time -f "\t%U, \t%S," ./a.out <testFile 1>/dev/null) 2>>testData
buff_size=$(($buff_size * 2))
rm -f a.out
rm -f testFile
done
I am measuring the time taken to execute the program and tabulate it. A test run produces the following data:
Test Data
Step Size:(doubles from previous size) Start Size:1 Stop Size:524288
1, 5.94, 17.81,
2, 5.53, 18.37,
4, 5.35, 18.37,
8, 5.58, 18.78,
16, 5.45, 18.96,
32, 5.96, 19.81,
64, 5.60, 18.64,
128, 5.62, 17.94,
256, 5.37, 18.33,
512, 5.70, 18.45,
1024, 5.43, 17.45,
2048, 5.22, 17.95,
4096, 5.57, 18.14,
8192, 5.88, 17.39,
16384, 5.39, 18.64,
32768, 5.27, 17.78,
65536, 5.22, 17.77,
131072, 5.52, 17.70,
262144, 5.60, 17.40,
524288, 5.96, 17.99,
I dont see any significant variation in user+system time as we use a different block size. But theoretically, as the block size becomes smaller, many number of system calls are generated for the same file size, and it should take more time to execute. I have seen test results in the book 'Advanced Programming in Unix Environment' by Richard Stevens for a similar test, which shows that user+system time reduces significantly if the buffer size used in copy is close to block size.(In my case, block size is 4096 bytes on an ext4 partition)
Why am i not able to reproduce these results? Am i missing some factors in these tests?

You did not disable the line #define SIZE 100 in your source code so the definition via option (-DSIZE=1000) does have influence only above this #define. On my compiler I get a warning for this (<command-line>:0:0: note: this is the location of the previous definition) at compile time.
If you comment out the #define you should be able to fix this error.
Another aspect which comes to mind:
If you create a file on a machine and read it right away after that, it will be in the OS's disk cache (which is large enough to store all of this file), so the actual disk block size won't have much of an influence here.
Stevens's book was written in 1992 when RAM was way more expensive than today, so maybe some information in there is outdated. I also doubt that newer editions of the book have taken things like these out because in general they are still true.

Related

psutil vs dd: monitoring disk I/O

I'm writing y.a.t. (yet-another-tool :)) for monitoring disk usage on Linux.
I'm using python 3.3.2 and psutil 3.3.0.
The process I'm monitoring does something really basic: I use the dd tool and I vary the block size (128, 512, 1024, 4096)
#!/bin/bash
dd if=./bigfile.txt of=./copy.img bs=4096
bigfile.txt:
$ stat bigfile.txt
File: ‘bigfile.txt’
Size: 87851423 Blocks: 171600 IO Block: 4096 regular file
And the snippet of the monitor is as follows:
def poll(interval, proc):
d_before = proc.io_counters()
time.sleep(interval)
tst = time.time()
d_after = proc.io_counters()
usage = OrderedDict.fromkeys(d_after.__dict__.keys())
for k, v in usage.items():
usage[k] = d_after.__dict__[k] - d_before.__dict__[k]
return tst, usage
At each run, I clear the cache (as suggested many times on stackoverflow):
rm copy.img && sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
My question is: why aren't the numbers matching?
bs=128:
dd:
686339+1 records in
686339+1 records out
87851423 bytes (88 MB) copied, 1.21664 s, 72.2 MB/s
monitor.py:
1450778750.104943 OrderedDict([('read_count', 686352), ('write_count', 686343), ('read_bytes', 87920640), ('write_bytes', 87855104)])
bs=4096
dd:
21448+1 records in
21448+1 records out
87851423 bytes (88 MB) copied, 0.223911 s, 392 MB/s
monitor.py:
1450779294.5541275 OrderedDict([('read_count', 21468), ('write_count', 21452), ('read_bytes', 88252416), ('write_bytes', 87855104)])
The difference is still there with all the values of bs.
Is it a matter of certains read/write not being counted? Does psutil performs some extra work? For example, with bs=4096, why in psutil 400993 more bytes (for read) and 3681 (for write) are reported?
Am I missing something big?
Thanks a lot.
EDIT: as an update, it doesn't matter the granularity of timings in the measurement, i.e., the time.sleep(interval) call. I tried with different values, and summing up the total number of reads and writes reported by psutil. The difference remains.
EDIT2: typo in snippet code
write_bytes
The read_bytes and write_bytes correspond to the same fields from /proc/<PID>/io. Quoting the documentation (emphasis mine):
read_bytes
----------
I/O counter: bytes read
Attempt to count the number of bytes which this process really did cause to
be fetched from the storage layer. Done at the submit_bio() level, so it is
accurate for block-backed filesystems.
write_bytes
-----------
I/O counter: bytes written
Attempt to count the number of bytes which this process caused to be sent to
the storage layer. This is done at page-dirtying time.
As you know, most (all?) filesystems are block-based. This implies that if you have a program that, say, writes just 5 bytes to a file, and if your block size if 4 KiB, then 4 KiB will be written.
If you don't trust dd, let's try with a simple Python script:
with open('something', 'wb') as f:
f.write(b'12345')
input('press Enter to exit')
This script should write only 5 bytes, but if we inspect /proc/<PID>/io, we can see that 4 KiB were written:
$ cat /proc/3455/io
rchar: 215317
wchar: 24
syscr: 66
syscw: 2
read_bytes: 0
write_bytes: 4096
cancelled_write_bytes: 0
This is the same thing that is happening with dd in your case.
You have asked dd to write 87851423 bytes. How many 4 KiB blocks are 87851423 bytes?
87851423 - (87851423 mod 4096) + 4096 = 87855104
Not by chance 87855104 is the number reported by psutil.
read_bytes
How about read_bytes? In theory we should have read_bytes equal to write_bytes, but actually read_bytes shows 16 more blocks in the first run, and 97 more blocks in the second run.
Well, first of all, let's see what files dd is actually reading:
$ strace -e trace=open,read -- dd if=/dev/zero of=zero bs=1M count=2
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\v\2\0\0\0\0\0"..., 832) = 832
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/dev/zero", O_RDONLY) = 3
open("zero", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 0
read(0, "# Locale name alias data base.\n#"..., 4096) = 2570
read(0, "", 4096) = 0
open("/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = 0
+++ exited with 0 +++
As you can see, dd is opening and reading the linker, the GNU C library, and locale files. It is reading more bytes than you can see above, because it's also using mmap, not just read.
The point is: dd reads many more files than the source file, therefore it's acceptable the read_bytes is much higher than write_bytes. But why is it inconsistent?
Those files that are read by dd are also used by many other programs. Even if you drop_caches just before executing dd, there are chances that some other process may reload one of these files into memory. You can try with this very simple C program:
int main()
{
while(1) {
}
}
Compiled with the default GCC options, this program does nothing except opening the linker and the GNU C library. If you try to drop_caches, execute the program and cat /proc/<PID>/IO more than once, you'll see that read_bytes will vary across runs (except if you perform the steps very fast, in which case the probability that some other program has loaded some files into the cache is low).

Mapped region still valid when size of underlying file changes?

Let's have a look at a few scenarios:
a)
file size : |---------|
mapped region: |---------|
region access: |XXXXXXXXX|
--> file grows
file size : |----------------|
mapped region: |---------|
region access: |XXXXXXXXX|
Is it still well-defined/portable/safe to access (read/write) the complete mapped region?
(assuming that the file grew via normal writes to it or via truncating it; file was just mapped once, no extra remapping after the file size changed)
b)
file size : |---------|
mapped region: |-----------------------|
access : |XXXXXXXXX|
--> file grows
file size : |-----------------------|
mapped region: |-----------------------|
access : |XXXXXXXXXXXXXXXXXXXXXXX|
Say, before the file was extended the program just accessed the intersection of the file size and the mapped region. This should be fine.
After the file grew - such that the sizes of the mapping and file match - is it now well defined to access every part of the region/file?
If this is the case, creating larger mapped regions in the beginning could be an optimization to avoid some mremap (or munmap/mmap) calls - at least for some use-cases.
c)
file size : |---------|
mapped region: |---------|
access : |XXXXXXXXX|
--> file is truncated
file size : |---|
mapped region: |---------|
access : |XXX|
As long as the program accesses the still overlapping part of the region - is that well-defined behaviour?
Generally, if size of mapped file is changed, it's safe to access pages not affected by size change, and it's unspecified what happens with pages affected by size change.
From mmap(2):
1.
If the size of the mapped file changes after the call to mmap() as a result of some other operation on the mapped file, the effect of references to portions of the mapped region that correspond to added or removed portions of the file is unspecified.
2.
The mmap() function can be used to map a region of memory that is larger than the current size of the object. Memory access within the mapping but beyond the current end of the underlying objects may result in SIGBUS signals being sent to the process.
So, in all three cases, it seems that it's safe to access all originally mapped pages below current file size and it's not safe to access pages above current file size.
I'm not totally sure about case (b), but it seems to be a valid case and it works at least in Linux.
Note that SIGBUS generation is not guaranteed and it's not specified what really happens when you access data above mapping size or above file size. Implementation may allow you to read valid data from the end of the page, for example.
There are also two optimization tricks related to question (to avoid mremap()):
You can allocate large memory region in heap and then subsequently mmap() pages you need using MAP_FIXED flag.
You can use linux-specific remap_file_pages(2) call.
Test program
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <sys/mman.h>
#define max(a, b) (a > b ? a : b)
int main(int argc, char** argv) {
int mmap_size = atoi(argv[1]);
int file_size1 = atoi(argv[2]);
int file_size2 = atoi(argv[3]);
char* data = malloc(max(file_size1, file_size2));
memset(data, 7, max(file_size1, file_size2));
int fd = open("foo", O_RDWR | O_TRUNC | O_CREAT, 0777);
write(fd, data, file_size1);
char* addr = mmap(NULL, mmap_size, PROT_READ, MAP_SHARED, fd, 0);
if (file_size2 <= file_size1)
ftruncate(fd, file_size2);
else
write(fd, data, file_size2 - file_size1);
printf("%d\n", addr[0]);
printf("%d\n", addr[file_size1 - 1]);
printf("%d\n", addr[file_size2 - 1]);
return 0;
}
Example output on Linux:
$ ./a.out 4096 4096 $(( 4096 * 2))
7
7
0
$ ./a.out $(( 4096 * 2 )) 4096 $(( 4096 * 2))
7
7
7
$ ./a.out $(( 4096 * 2 )) $(( 4096 * 2)) 4096
7
Bus error
1) The file grows after having been mapped
If you know that the file will grow, you would map it with the matching flag so that the mapping grows with the file.
If you do not know that the file will grow, you would also not access behind the mapped area.
2) The file shrinks after having been mapped
If you know that the file has shrunk, there is no reason to access the the area after the end of the file as you would get a signal.
If you do not know that the file has shrunk, see my other answer to this question.

Accessing large memory (32 GB) using /dev/zero

I want to use /dev/zero for storing lots of temporary data (32 GB or around that). I am doing this:
fd = open("/dev/zero", O_RDWR );
// <Exit on error>
vbase = (uint64_t*) mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
// <Exit on error>
ftruncate(fd, (off_t) MEMSIZE);
I am changing MEMSIZE from 1GB to 32 GB (performing a memtest) to see if I can really access all that range. I am running out of memory at 1 GB.
Is there something I am missing ? Am I mmap'ing correctly ?
Or am I running into some system limit ? How can I check if this is happening ?
P.S: I run many programs that generate many gigs of data within a single file, so I dont know if there is an artificial upper limit, just that I seem to be running into something.
I have to admit I'm confused about what you're actually trying to do. Anyway, a couple of reason why what you do might not work:
From the mmap(2) manpage: "MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero. The fd and offset arguments are ignored;"
From the null(4) manpage: "Data written to a null or zero special file is discarded."
So anyway, before MAP_ANONYMOUS, mmap'ing /dev/zero was sometimes used to get anonymous (i.e. not backed by any file) memory. No need to do both. In either case, actually writing to all that memory implies that you need some kind of backing store for it, either physical memory or swap space. If you cannot guarantee that, maybe it's better to mmap() a real file on a filesystem with enough space?
Look into Linux kernel mmap implementation:
vm_mmap vm_mmap_pgoff  do_mmap_pgoff  mmap_region  file->f_op->mmap(file, vma)
In the function do_mmap_pgoff, it checks the max_map_count
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
root> sysctl -a | grep map_count
vm.max_map_count = 65530
In the function mmap_region, it checks the process virtual address limit (whether it is unlimited).
int may_expand_vm(struct mm_struct *mm, unsigned long npages)
{
unsigned long cur = mm->total_vm; /* pages */
unsigned long lim;
lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
if (cur + npages > lim)
return 0;
return 1;
}
root> ulimit -a | grep virtual
virtual memory (kbytes, -v) unlimited
In linux kernel, init task has the rlimit setting by default.
[RLIMIT_AS] = { RLIM_INFINITY, RLIM_INFINITY }, \
#ifndef RLIM_INFINITY
# define RLIM_INFINITY (~0UL)
#endif
In order to prove it, use the test_mem program
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
struct rlimit rl;
int ret;
ret = getrlimit(RLIMIT_AS, &rl);
if (ret == 0) {
printf("RLIMIT_AS limit got sucessfully:\n");
printf("soft_limit=%lld, hard_limit=%lld\n", (long long)rl.rlim_cur, (long long)rl.rlim_max);
}
That means unlimited means 0xFFFFFFFF for 32bit app in the 64bit OS. Change the shell virtual address limit, it could reflect correctly.
root> ulimit -v 1024000
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=1048576000, hard_limit=1048576000
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
In mmap_region, there is an accountable check
accountable_mapping  security_vm_enough_memory_mm  cap_vm_enough_memory  __vm_enough_memory  overcommit/swap/admin and user reserve handling.
Please follow the three steps to check whether they can meet.

FUSE fseek unexpected behaviour with direct_io

I'm trying to write a FUSE filesystem that presents streamable music as mp3 files. I don't want to start to stream the audio when just the ID3v1.1 tag is read, so I mount the filesystem with direct_io and max_readahead=0.
But when I do this (which is also what libid3tag does), I get reads of 2752 bytes with offset -2880 bytes from the end:
char tmp[255];
FILE* f = fopen("foo.mp3", "r");
fseek(f, -128, SEEK_END);
fread(tmp, 1, 10, f);
Why is this? I expect to get a call to read with an offset exactly 128 bytes from the end with size 10..
The amount of bytes read seems to vary somewhat.
I've had similar issue and filed an issue with s3fs. Checkout issue : http://code.google.com/p/s3fs/issues/detail?can=2&q=&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&groupby=&sort=&id=241
additionally, checkout line 1611 in the s3fs.cpp:
http://code.google.com/p/s3fs/source/browse/trunk/src/s3fs.cpp?r=316
// error check this
// fseek (pSourceFile , 0 , SEEK_END);

Tools to reduce risk regarding password security and HDD slack space

Down at the bottom of this essay is a comment about a spooky way to beat passwords. Scan the entire HDD of a user including dead space, swap space etc, and just try everything that looks like it might be a password.
The question: part 1, are there any tools around (A live CD for instance) that will scan an unmounted file system and zero everything that can be? (Note I'm not trying to find passwords)
This would include:
Slack space that is not part of any file
Unused parts of the last block used by a file
Swap space
Hibernation files
Dead space inside of some types of binary files (like .DOC)
The tool (aside from the last case) would not modify anything that can be detected via the file system API. I'm not looking for a block device find/replace but rather something that just scrubs everything that isn't part of a file.
part 2, How practical would such a program be? How hard would it be to write? How common is it for file formats to contain uninitialized data?
One (risky and costly) way to do this would be to use a file system aware backup tool (one that only copies the actual data) to back up the whole disk, wipe it clean and then restore it.
I don't understand your first question (do you want to modify the file system? Why? Isn't this dead space exactly where you want to look?)
Anyway, here's an example of such a tool:
#include <stdio.h>
#include <alloca.h>
#include <string.h>
#include <ctype.h>
/* Number of bytes we read at once, >2*maxlen */
#define BUFSIZE (1024*1024)
/* Replace this with a function that tests the passwort consisting of the first len bytes of pw */
int testPassword(const char* pw, int len) {
/*char* buf = alloca(len+1);
memcpy(buf, pw,len);
buf[len] = '\0';
printf("Testing %s\n", buf);*/
int rightLen = strlen("secret");
return len == rightLen && memcmp(pw, "secret", len) == 0;
}
int main(int argc, char* argv[]) {
int minlen = 5; /* We know the password is at least 5 characters long */
int maxlen = 7; /* ... and at most 7. Modify to find longer ones */
int avlen = 0; /* available length - The number of bytes we already tested and think could belong to a password */
int i;
char* curstart;
char* curp;
FILE* f;
size_t bytes_read;
char* buf = alloca(BUFSIZE+maxlen);
if (argc != 2) {
printf ("Usage: %s disk-file\n", argv[0]);
return 1;
}
f = fopen(argv[1], "rb");
if (f == NULL) {
printf("Couldn't open %s\n", argv[1]);
return 2;
}
for(;;) {
/* Copy the rest of the buffer to the front */
memcpy(buf, buf+BUFSIZE, maxlen);
bytes_read = fread(buf+maxlen, 1, BUFSIZE, f);
if (bytes_read == 0) {
/* Read the whole file */
break;
}
for (curstart = buf;curstart < buf+bytes_read;) {
for (curp = curstart+avlen;curp < curstart + maxlen;curp++) {
/* Let's assume the password just contains letters and digits. Use isprint() otherwise. */
if (!isalnum(*curp)) {
curstart = curp + 1;
break;
}
}
avlen = curp - curstart;
if (avlen < minlen) {
/* Nothing to test here, move along */
curstart = curp+1;
avlen = 0;
continue;
}
for (i = minlen;i <= avlen;i++) {
if (testPassword(curstart, i)) {
char* found = alloca(i+1);
memcpy(found, curstart, i);
found[i] = '\0';
printf("Found password: %s\n", found);
}
}
avlen--;
curstart++;
}
}
fclose(f);
return 0;
}
Installation:
Start a Linux Live CD
Copy the program to the file hddpass.c in your home directory
Open a terminal and type the following
su || sudo -s # Makes you root so that you can access the HDD
apt-get install -y gcc # Install gcc
This works only on Debian/Ubuntu et al, check your system documentation for others
gcc -o hddpass hddpass.c # Compile.
./hddpass /dev/YOURDISK # The disk is usually sda, hda on older systems
Look at the output
Test (copy to console, as root):
gcc -o hddpass hddpass.c
</dev/zero head -c 10000000 >testdisk # Create an empty 10MB file
mkfs.ext2 -F testdisk # Create a file system
rm -rf mountpoint; mkdir -p mountpoint
mount -o loop testdisk mountpoint # needs root rights
</dev/urandom head -c 5000000 >mountpoint/f # Write stuff to the disk
echo asddsasecretads >> mountpoint/f # Write password in our pagefile
# On some file systems, you could even remove the file.
umount testdisk
./hdpass testdisk # prints secret
Test it yourself on an Ubuntu Live CD:
# Start a console and type:
wget http://phihag.de/2009/so/hddpass-testscript.sh
sh hddpass-testscript.sh
Therefore, it's relatively easy. As I found out myself, ext2 (the file system I used) overwrites deleted files. However, I'm pretty sure some file systems don't. Same goes for the pagefile.
How common is it for file formats to contain uninitialized data?
Less and less common, I would've thought. The classic "offender" is older versions of MS office applications that (essentially) did a memory dump to disk as its "quicksave" format. No serialisation, no selection of what to dump and a memory allocator that doesn't zero newly allocated memory pages. That lead to not only juicy things from previous versions of the document (so the user could use undo), but also juicy snippets from other applications.
How hard would it be to write?
Something that clears out unallocated disk blocks shouldn't be that hard. It'd need to run either off-line or as a kernel module, so as to not interfer with normal file-system operations, but most file systems have an "allocated"/"not allocated" structure that is fairly straight-forward to parse. Swap is harder, but as long as you're OK with having it cleared on boot (or shutdown), it's not too tricky. Clearing out the tail block is trickier, definitely not something I'd want to try to do on-line, but it shouldn't be TOO hard to make it work for off-line cleaning.
How practical would such a program be?
Depends on your threat model, really. I'd say that on one end, it'd not give you much at all, but on the other end, it's a definite help to keep information out of the wrong hands. But I can't give a hard and fast answer,
Well, if I was going to code it for a boot CD, I'd do something like this:
File is 101 bytes but takes up a 4096-byte cluster.
Copy the file "A" to "B" which has nulls added to the end.
Delete "A" and overwrite it's (now unused) cluster.
Create "A" again and use the contents of "B" without the tail (remember the length).
Delete "B" and overwrite it.
Not very efficient, and would need a tweak to make sure you don't try to copy the first (and therefor full) clusters in a file. Otherwise, you'll run into slowness and failure if there's not enough free space.
There's tools that do this efficiently that are open source?

Resources