Determine the number of "logical" bytes read/written in a Linux system - linux

I would like to determine the number of bytes logically read/written by all processes via syscalls such as read() and write(). This is different than the number of bytes actually fetched from the storage layer (displayed by tools like iotop) since it includes (for example) reads that hit the pagecache, and is also differs in when writes are recognized: the logical write IO happens immediately when the write call is issued, while the actual physical IO may occur some time later depending on various factors (Linux usually buffers writes and does the physical IO some time later).
I know how to do it on a per-process basis (see this question for example), but not how to the get the system-wide count.

If you want to use /proc filesystem for the total counts (and not for per second counts), it is quite easy.
This works also on quite old kernels (tested on Debian Squeeze 2.6.32 kernel).
# cat /proc/1979/io
rchar: 111195372883082
wchar: 10424431162257
syscr: 130902776102
syscw: 6236420365
read_bytes: 2839822376960
write_bytes: 803408183296
cancelled_write_bytes: 374812672
For system-wide, just sum the numbers from all processes, which however will be good enough only in short-term, because as processes die, their statistics are removed from memory. You would need process accounting enabled to save them.
Meaning of these files is documented in the kernel sources file Documentation/filesystems/proc.txt:
rchar - I/O counter: chars read
The number of bytes which this task has caused
to be read from storage. This is simply the sum of bytes which this
process passed to read() and pread(). It includes things like tty IO
and it is unaffected by whether or not actual physical disk IO was
required (the read might have been satisfied from pagecache)
wchar - I/O counter: chars written
The number of bytes which this task has
caused, or shall cause to be written to disk. Similar caveats apply
here as with rchar.
syscr - I/O counter: read syscalls
Attempt to count the number of read I/O
operations, i.e. syscalls like read() and pread().
syscw - I/O counter: write syscalls
Attempt to count the number of write I/O
operations, i.e. syscalls like write() and pwrite().
read_bytes - I/O counter: bytes read
Attempt to count the number of bytes which
this process really did cause to be fetched from the storage layer.
Done at the submit_bio() level, so it is accurate for block-backed
filesystems.
write_bytes - I/O counter: bytes written
Attempt to count the number of bytes which
this process caused to be sent to the storage layer. This is done at
page-dirtying time.
cancelled_write_bytes
The big inaccuracy here is truncate. If a process writes 1MB to a file
and then deletes the file, it will in fact perform no writeout. But it
will have been accounted as having caused 1MB of write. In other
words: The number of bytes which this process caused to not happen, by
truncating pagecache. A task can cause "negative" IO too.

Here is a SystemTap script that tracks the logical IO. It is based on the script at https://sourceware.org/systemtap/SystemTap_Beginners_Guide/traceiosect.html
#! /usr/bin/env stap
# traceio.stp
# Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo#redhat.com>
# Copyright (C) 2009 Kai Meyer <kai#unixlords.com>
# Fixed a bug that allows this to run longer
# And added the humanreadable function
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.
#
global reads, writes
probe vfs.read.return {
if ($return > 0) {
reads += $return
}
}
probe vfs.write.return {
if ($return > 0) {
writes += $return
}
}
function humanreadable(bytes) {
if (bytes > 1024*1024*1024) {
return sprintf("%d GiB", bytes/1024/1024/1024)
} else if (bytes > 1024*1024) {
return sprintf("%d MiB", bytes/1024/1024)
} else if (bytes > 1024) {
return sprintf("%d KiB", bytes/1024)
} else {
return sprintf("%d B", bytes)
}
}
probe timer.s(1) {
printf("reads: %12s writes: %12s\n", humanreadable(reads), humanreadable(writes))
# Note we don't zero out reads and writes,
# so the values are cumulative since the script started.
}

Related

Linux UART slower than specified Baudrate

I'm trying to communicate between two Linux systems via UART.
I want to send large chunks of data. With the specified Baudrate it should take around 5 seconds, but it takes nearly 10 times the expected time.
As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between. If I measure the time needed for the drain and the number of bytes written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
I would expect a slower transmission as the optimal, but not this much.
Did I miss something while setting the UART or while writing? Or is this normal?
The code used for setup:
int bus = open(interface.c_str(), O_RDWR | O_NOCTTY | O_NDELAY); // <- also tryed blocking
if (bus < 0) {
return;
}
struct termios options;
memset (&options, 0, sizeof options);
if(tcgetattr(bus, &options) != 0){
close(bus);
bus = -1;
return;
}
cfsetspeed (&options, B230400);
cfmakeraw(&options); // <- also tried this manually. did not make a difference
if(tcsetattr(bus, TCSANOW, &options) != 0)
{
close(bus);
bus = -1;
return;
}
tcflush(bus, TCIFLUSH);
The code used to send:
int32_t res = write(bus, data, dataLength);
while (res < dataLength){
tcdrain(bus); // <- taking way longer than expected
int32_t r = write(bus, &data[res], dataLength - res);
if(r == 0)
break;
if(r == -1){
break;
}
res += r;
}
B230400
The docs are contradictory. cfsetspeed is documented as requiring a speed_t type, while the note says you need to use one of the "B" constants like "B230400." Have you tried using an actual speed_t type?
In any case, the speed you're supplying is the baud rate, which in this case should get you approximately 23,000 bytes/second, assuming there is no throttling.
The speed is dependent on hardware and link limitations. Also the serial protocol allows pausing the transmission.
FWIW, according to the time and speed you listed, if everything works perfectly, you'll get about 1 MB in 50 seconds. What speed are you actually getting?
Another "also" is the options structure. It's been years since I've had to do any serial I/O, but IIRC, you need to actually set the options that you want and are supported by your hardware, like CTS/RTS, XON/XOFF, etc.
This might be helpful.
As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between.
You have only provided code snippets (rather than a minimal, complete, and verifiable example), so your data size is unknown.
But the Linux kernel buffer size is known. What do you think it is?
(FYI it's 4KB.)
If I measure the time needed for the drain and the number of bytese written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
You're confusing throughput with baudrate.
The maximum throughput (of just payload) of an asynchronous serial link will always be less than the baudrate due to framing overhead per character, which could be two of the ten bits of the frame (assuming 8N1). Since your termios configuration is incomplete, the overhead could actually be three of the eleven bits of the frame (assuming 8N2).
In order to achieve the maximum throughput, the tranmitting UART must saturate the line with frames and never let the line go idle.
The userspace program must be able to supply data fast enough, preferably by one large write() to reduce syscall overhead.
Did I miss something while setting the UART or while writing?
With Linux, you have limited access to the UART hardware.
From userspace your program accesses a serial terminal.
Your program accesses the serial terminal in a sub-optinal manner.
Your termios configuration appears to be incomplete.
It leaves both hardware and software flow-control untouched.
The number of stop bits is untouched.
The Ignore modem control lines and Enable receiver flags are not enabled.
For raw reading, the VMIN and VTIME values are not assigned.
Or is this normal?
There are ways to easily speed up the transfer.
First, your program combines non-blocking mode with non-canonical mode. That's a degenerate combination for receiving, and suboptimal for transmitting.
You have provided no reason for using non-blocking mode, and your program is not written to properly utilize it.
Therefore your program should be revised to use blocking mode instead of non-blocking mode.
Second, the tcdrain() between write() syscalls can introduce idle time on the serial link. Use of blocking mode eliminates the need for this delay tactic between write() syscalls.
In fact with blocking mode only one write() syscall should be needed to transmit the entire dataLength. This would also minimize any idle time introduced on the serial link.
Note that the first write() does not properly check the return value for a possible error condition, which is always possible.
Bottom line: your program would be simpler and throughput would be improved by using blocking I/O.

GNU Parallel | pipe command

I am completely new in using GNU parallel and I need your advice in running the command below using GNU parallel:
/home/admin/Gfinal/decoder/decdr.pl --gh --w14b /data/tmp/KRX12/a.bin |
perl /home/admin/decout/decoder/flow.pl >> /data/tmp/decodedgfile/out_1.txt
I will run this command on a list of files (.bin), so what is the best (fastest) approach to achieve that using GNU parallel noting that the output of the first part of the command (/home/admin/Gfinal/decoder/decdr.pl --gh --w14b) is very large (> 2 GB).
Any help would be appreciated.
Will this work:
parallel /home/admin/Gfinal/decoder/decdr.pl --gh --w14b {} '|' perl /home/admin/decout/decoder/flow.pl >> /data/tmp/decodedgfile/out_1.txt ::: /data/tmp/KRX12/*.bin
(If the output from flow.pl is more than your disk I/O can handle, try parallel --compress).
Or maybe:
parallel /home/admin/Gfinal/decoder/decdr.pl --gh --w14b {} '|' perl /home/admin/decout/decoder/flow.pl '>>' /data/tmp/decodedgfile/out_{#}.txt ::: /data/tmp/KRX12/*.bin
It depends on whether you want a single output file or one per input file.
Also spend an hour walking through the tutorial. Your command line will love you for it. man parallel_tutorial
Here are some great videos for gnu-parallel / parallel
Ref youtube Part 1: GNU Parallel script processing and execution
Here is a link from the GNU web site for platform specific information.
Ref gnu parallel download information
"Multiple input sources
GNU parallel can take multiple input sources given on the command line. GNU
parallel then generates all combinations of the input sources:
parallel echo ::: A B C ::: D E F
Output (the order may be different):
A D
A E
A F
B D
B E
............
The input sources can be files:
parallel -a abc-file -a def-file echo"
Ref GNU-Parallel-Tutorial
With reference to the pipe
Pipe capacity
A pipe has a limited capacity. If the pipe is full, then a write(2)
will block or fail, depending on whether the O_NONBLOCK flag is set
(see below). Different implementations have different limits for the
pipe capacity. Applications should not rely on a particular
capacity: an application should be designed so that a reading process
consumes data as soon as it is available, so that a writing process
does not remain blocked.
In Linux versions before 2.6.11, the capacity of a pipe was the same
as the system page size (e.g., 4096 bytes on i386). Since Linux
2.6.11, the pipe capacity is 65536 bytes. Since Linux 2.6.35, the
default pipe capacity is 65536 bytes, but the capacity can be queried
and set using the fcntl(2) F_GETPIPE_SZ and F_SETPIPE_SZ operations.
See fcntl(2) for more information.
PIPE_BUF
POSIX.1 says that write(2)s of less than PIPE_BUF bytes must be
atomic: the output data is written to the pipe as a contiguous
sequence. Writes of more than PIPE_BUF bytes may be nonatomic: the
kernel may interleave the data with data written by other processes.
POSIX.1 requires PIPE_BUF to be at least 512 bytes. (On Linux,
PIPE_BUF is 4096 bytes.) The precise semantics depend on whether the
file descriptor is nonblocking (O_NONBLOCK), whether there are
multiple writers to the pipe, and on n, the number of bytes to be
written:
Ref man7.org pipe
You could have a look at fcntl F_GETPIPE_SZ and F_SETPIPE_SZ operations for more information.
Ref fcntl
All the best

Is overwriting a small file atomic on ext4?

Assume we have a file of FILE_SIZE bytes, and:
FILE_SIZE <= min(page_size, physical_block_size);
file size never changes (i.e. truncate() or append write() are never performed);
file is modified only by completly overwriting its contents using:
pwrite(fd, buf, FILE_SIZE, 0);
Is it guaranteed on ext4 that:
Such writes are atomic with respect to concurrent reads?
Such writes are transactional with respect to a system crash?
(i.e., after a crash the file's contents is completely from some previous write and we'll never see a partial write or empty file)
Is the second true:
with data=ordered?
with data=journal or alternatively with journaling enabled for a single file?
(using ioctl(fd, EXT4_IOC_SETFLAGS, EXT4_JOURNAL_DATA_FL))
when physical_block_size < FILE_SIZE <= page_size?
I've found related question which links discussion from 2011. However:
I didn't find an explicit answer for my question 2.
I wonder, if the above is true, is it documented somewhere?
From my experiment it was not atomic.
Basically my experiment was to have two processes, one writer and one reader. The writer writes to a file in a loop and reader reads from the file
Writer Process:
char buf[][18] = {
"xxxxxxxxxxxxxxxx",
"yyyyyyyyyyyyyyyy"
};
i = 0;
while (1) {
pwrite(fd, buf[i], 18, 0);
i = (i + 1) % 2;
}
Reader Process
while(1) {
pread(fd, readbuf, 18, 0);
//check if readbuf is either buf[0] or buf[1]
}
After a while of running both processes, I could see that the readbuf is either xxxxxxxxxxxxxxxxyy or yyyyyyyyyyyyyyyyxx.
So it definitively shows that the writes are not atomic. In my case 16byte writes were always atomic.
The answer was: POSIX doesn't mandate atomicity for writes/reads except for pipes. The 16 byte atomicity that I saw was kernel specific and may/can change in future.
Details of the answer in the actual post:
write(2)/read(2) atomicity between processes in linux
I am familiar with theory about filesystems in general, not with implementation of Ext4. Take this as educated guess.
Yes, I believe one sector reads and writes will be atomic because
Link you provided quotes "Currently concurrent reads/writes are atomic only wrt individual pages, however are not on the system call. "
Disk sector (512 bytes) writes are atomic according to Stephen Tweedie. In private email conversation with him, he acknowledged that this guarantee is only as good as the hardware.
Ext filesystems overwrite data in place, no copy on write. No allocation.
There is some effort to implement inline data, very small files data can fit in the inode itself. If you only need to store few bytes, that may have impact.
Not sure about one page, but it would make little sense in full journaling mode to send less than a page to the journal before commiting.

Does RCHAR include READ_BYTES (proc/<pid>/io)?

I read proc/<pid>/io to measure the IO-activity of SQL-queries, where <pid> is the PID of the database server. I read the values before and after each query to compute the difference and get the number of bytes the request caused to be read and/or written.
As far as I know the field READ_BYTES counts actual disk-IO, while RCHAR includes more, like reads that could be satisfied by the linux page cache (see Understanding the counters in /proc/[pid]/io for clarification).
This leads to the assumption, that RCHAR should come up with a value equal or greater than READ_BYTES, but my results contradict this assumption.
I could imagine some minor block or page overhead for results I get for Infobright ICE (values are MB):
Query RCHAR READ_BYTES
tpch_q01.sql| 34.44180| 34.89453|
tpch_q02.sql| 2.89191| 3.64453|
tpch_q03.sql| 32.58994| 33.19531|
tpch_q04.sql| 17.78325| 18.27344|
But I completely fail to understand the IO-counters for MonetDB (values are MB):
Query RCHAR READ_BYTES
tpch_q01.sql| 0.07501| 220.58203|
tpch_q02.sql| 1.37840| 18.16016|
tpch_q03.sql| 0.08272| 162.38281|
tpch_q04.sql| 0.06604| 83.25391|
Am I wrong with the assumption that RCHAR includes READ_BYTES? Is there a way to trick out the kernels counters, that MonetDB could use? What is going on here?
I might add, that I clear the page cache and restart the database-server before each query.
I'm on Ubuntu 11.10, running kernel 3.0.0-15-generic.
I can only think of two things:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/proc.txt;hb=HEAD#l1305
1:
1446 read_bytes
1447 ----------
1448
1449 I/O counter: bytes read
1450 Attempt to count the number of bytes which this process really did cause to
1451 be fetched from the storage layer.
I read "Caused to be fetched from the storage layer" to include readahead, whatever.
2:
1411 rchar
1412 -----
1413
1414 I/O counter: chars read
1415 The number of bytes which this task has caused to be read from storage. This
1416 is simply the sum of bytes which this process passed to read() and pread().
1417 It includes things like tty IO and it is unaffected by whether or not actual
1418 physical disk IO was required (the read might have been satisfied from
1419 pagecache)
Note that this says nothing about "disk access via memory mapped files". I think this is the more likely reason, and that your MonetDB probably mmaps out its database files and then does everything on them.
I'm not really sure how you could check the used bandwidth on mmap, because of its nature.
You can also read Linux kernel source code file: /include/linux/task_io_accounting.h
struct task_io_accounting {
#ifdef CONFIG_TASK_XACCT
/* bytes read */
u64 rchar;
/* bytes written */
u64 wchar;
/* # of read syscalls */
u64 syscr;
/* # of write syscalls */
u64 syscw;
#endif /* CONFIG_TASK_XACCT */
#ifdef CONFIG_TASK_IO_ACCOUNTING
/*
* The number of bytes which this task has caused to be read from
* storage.
*/
u64 read_bytes;
/*
* The number of bytes which this task has caused, or shall cause to be
* written to disk.
*/
u64 write_bytes;
/*
* A task can cause "negative" IO too. If this task truncates some
* dirty pagecache, some IO which another task has been accounted for
* (in its write_bytes) will not be happening. We _could_ just
* subtract that from the truncating task's write_bytes, but there is
* information loss in doing that.
*/
u64 cancelled_write_bytes;
#endif /* CONFIG_TASK_IO_ACCOUNTING */
};

What do the counters in /proc/[pid]/io mean?

I'm creating a plugin for Munin to monitor stats of named processes. One of the sources of information would be /proc/[pid]/io. But I have a hard time finding out what the difference is between rchar/wchar and read_bytes/written_bytes.
They are not the same, as they provide different values. What do they represent?
While the proc manpage is woefully behind (and so are most manpages/documentation on anything not relating to cookie-cutter user-space development), this stuff is fortunately documented completely in the Linux kernel source under Documentation/filesystems/proc.rst. Here are the relevant bits:
rchar
-----
I/O counter: chars read
The number of bytes which this task has caused to be read from storage. This
is simply the sum of bytes which this process passed to read() and pread().
It includes things like tty IO and it is unaffected by whether or not actual
physical disk IO was required (the read might have been satisfied from
pagecache)
wchar
-----
I/O counter: chars written
The number of bytes which this task has caused, or shall cause to be written
to disk. Similar caveats apply here as with rchar.
read_bytes
----------
I/O counter: bytes read
Attempt to count the number of bytes which this process really did cause to
be fetched from the storage layer. Done at the submit_bio() level, so it is
accurate for block-backed filesystems. <please add status regarding NFS and
CIFS at a later time>
write_bytes
-----------
I/O counter: bytes written
Attempt to count the number of bytes which this process caused to be sent to
the storage layer. This is done at page-dirtying time.

Resources