Purpose of ibs/obs/bs in dd - linux

I have a script that creates file system in a file on a linux machine. I see that to create the file system, it uses 'dd' with bs=x option, reads from /dev/zero and writes to a file. I think usually specifying ibs/obs/bs is useful to read from real hardware devices as one has specific block size constraints. In this case however, as it is reading from virtual device and writing to a file, I don't see any point behind using 'bs=x bytes' option. Is my understanding wrong here?
(Just in case if it helps, this file system is later on used to boot a qemu vm)

To understand block sizes, you have to be familiar with tape drives. If you're not interested in tape drives - for example, you don't think you're ever going to use one - then you can go back to sleep now.
Remember the tape drives from films in the 60s, 70s, maybe even 80s? The ones where the reel went spinning around, and so on? Not your Exabyte or even QIC - quarter-inch cartridge - tapes; your good old fashioned reel-to-reel half-inch tape drives? On those, block size mattered.
The data on a tape was written in blocks. Each block was separated from the next by an inter-record gap.
----+-------+-----+-------+-----+----
... | block | IRG | block | IRG | ...
----+-------+-----+-------+-----+----
Depending on the tape drive hardware and software, there were a variety of problems that could happen. For example, if the tape was written with a block size of 5120 bytes and you read the tape with a block size of 512 bytes, then the tape drive might read the first block, return you 512 bytes of it, and then discard the remaining data; the next read would start on the next block. Conversely, if the tape was written with a block size of 512 bytes and you requested blocks of 5120 bytes, you would get short reads; each read would return just 512 bytes, and if your software wasn't paying attention, you'd be reading garbage. There was also the issue that the tape drive had to get up to speed to read the block, and then slow down. The ASCII art suggests that the IRG was smaller than the data blocks; that was not necessarily the case. And it took time to read one block, overshoot the IRG, rewind backwards to get to the next block, and start forwards again. And if the tape drive didn't have the memory to buffer data - the cheaper ones did not - then you could seriously affect your tape drive performance.
War story: work prepared on newer machine with a slightly more modern tape drive. I wrote a tape using tar without a sensible block size (so it defaulted to 512 bytes). It was a large bit of software - must have been, oh, less than 100 MB in total (a long time ago, in other words). The tape wrote nicely because the machine was modern enough, and it took just a few seconds to do so. But, I had to get the material off the tape on a machine with an older tape drive, one that did not have any on-board buffer. So, it read the material, 512 bytes at a time, and the reel rocked forward, reading one block, and then rocked back all but maybe half an inch, and then read forwards to get to the next block, and then rocked back, and ... well, you could see it doing this, and since it took appreciable portions of a second to read each 512 byte block, the total time taken was horrendous. My flight was due to leave...and I needed to get that data across too. (It was long enough ago, and in a land far enough away, that last minute changes to flights weren't much of an option either.) To cut a long story short, it did get read - but if I'd used a sensible block size (such as 5120 bytes instead of the default of 512), I would have been done much, much quicker and with much less danger of missing the plane (but I did actually catch the plane, with maybe 20 minutes to spare, despite a taxi ride across Paris in the rush hour).
With more modern tape drives, there was enough memory on the drive to do buffering and getting a tape drive to stream - write continuously without reversing - was feasible. It used to be that I'd use a block size like 256 KB to get QIC tapes to stream. I've not done much with tape drives recently - let's see, not this millennium and not much for a few years before that, either; certainly not much since CD and DVD became the software distribution mechanisms of choice (when electronic download wasn't used).
But the block size really did matter in the old days. And dd provided good support for it. You could even transfer data from a tape drive that was written with, say, 4 KB block to another that you wanted to write with, say, 16 KB blocks, by specifying the ibs (input block size) separately from the obs (output block size). Darned useful!
Also, the count parameter is in terms of the (input) block size. It was useful to say 'dd bs=1024 count=1024 if=/dev/zero of=/my/file/of/zeroes' to copy 1 MB of zeroes around. Or to copy 1 MB of a file.
The importance of dd is vastly diminished; it was an essential part of the armoury for anybody who worked with tape drives a decade or more ago.

The block size is the number of bytes that are read and written at a time. Presumably there is a count= option, and that is specified in units of the block size. If there is a skip= or seek= option, those will also be in block size units. However if you are reading and writing a regular file, and there are no disk errors, then the block size doesn't really matter as long as you can scale those parameters accordingly and they are still integers. However certain sizes may be more efficient than others.

For reading from /dev/zero, it doesn't matter. ibs/obs/bs specify how many bytes will be read at a time. It's helpful to choose a number based on the way bytes are read/written in the operating system. For instance, Linux usually reads from a hard drive in 4096 byte chunks. If you have at least some idea about how the underlying hardware reads/writes, it might be a good idea to specify ibs/obs/bs. By the way, if you specify bs, it will override whatever you specify for ibs and obs.

In addition to the great answer by Jonathan Leffler, keep in mind that the bs= option isn't always a substitute for using both ibs= and obs=, in particular for the old [ugly] days of tape drives.
The bs= option reserves the right for dd to write the data as soon as it's read. This can cause you to no longer have identically sized blocks on the output. Here is GNU's take on this, but the behavior dates back as far as I can remember (80's):
(bs=) Set both input and output block sizes to bytes. This makes dd read and write bytes per block, overriding any ‘ibs’ and ‘obs’ settings. In addition, if no data-transforming conv option is specified, input is copied to the output as soon as it’s read, even if it is smaller than the block size.
For instance, back in the QIC days on an old Sun system, if you did this:
tar cvf /dev/rst0c /bla
It would work, but cause an enormous amount of back and forth thrashing while the drive wrote a small block, and tried to backup and read to reposition itself properly for the next write.
If you swapped this with:
tar cvf - /bla | dd ibs=16K obs=16K of=/dev/rst0c
You'd get the QIC drive writing much larger chunks and not thrashing quite so much.
However, if you made the mistake of this:
tar cvf - /bla | dd bs=16K of=/dev/rst0c
You'd run the risk of having precisely the same thrashing you had before depending upon how much data was available at the time of each read.
Specifying both ibs= and obs= precludes this from happening.

Related

Getting the percentage of used space and used inodes in a mount

I need to calculate the percentage of used space and used inodes for a mount path (e.g. /mnt/mycustommount) in Go.
This is my attempt:
var statFsOutput unix.Statfs_t
err := unix.Statfs(mount_path, &statFsOutput)
if err != nil {
return err
}
totalBlockCount := statFsOutput.Blocks // Total data blocks in filesystem
freeSystemBlockCount = statFsOutput.Bfree // Free blocks in filesystem
freeUserBlockCount = statFsOutput.Bavail // Free blocks available to unprivileged user
Now the proportion I need would be something like this:
x : 100 = (totalBlockCount - free[which?]BlockCount) : totalBlockCount
i.e. x : 100 = usedBlockCount : totalBlockCount . But I don't understand the difference between Bfree and Bavail (what's 'unprivileged' user go to do with filesystem blocks?).
For inodes my attempt:
totalInodeCount = statFsOutput.Files
freeInodeCount = statFsOutput.Ffree
// so now the proportion is
// x : 100 = (totalInodeCount - freeInodeCount) : totalInodeCount
How to get the percentage for used storage?
And is the inodes calculation I did correct?
Your comment expression isn't valid Go, so I can't really interpret it without guessing. With guessing, I interpret it as correct, but have I guessed what you actually mean, or merely what I think you mean? In other words, without showing actual code, I can only imagine what your final code will be. If the code I imagine isn't the actual code, the correctness of the code I imagine you will write is irrelevant.
That aside, I can answer your question here:
(what's 'unprivileged' user go to do with filesystem blocks?)
The Linux statfs call uses the same fields as 4.4BSD. The default 4.4BSD file system (the one called the "fast file system") uses a blocks-with-fragmentation approach to allocate blocks in a sort of stochastic manner. This allocation process works very well on an empty file system, and continues to work well, without extreme slowdown, on somewhat-full file systems. Computerized modeling of its behavior, however, showed pathological slowdowns (amounting to linear search, more or less) were possible if the block usage exceeded somewhere around 90%.
(Later, analysis of real file systems found that the slowdowns generally did not hit until the block usage exceeded 95%. But the idea of a 10% "reserve" was pretty well established by then.)
Hence, if a then-popular large-size disk drive of 400 MB1 gave 10% for inodes and another 10% for reserved blocks, that meant that ordinary users could allocate about 320 MB of file data. At that point the drive was "100% full", but it could go to 111% by using up the remaining blocks. Those blocks were reserved to the super-user though.
These days, instead of a "super user", one can have a capability that can be granted or revoked. However, these days we don't use the same file systems either. So there may be no difference between bfree and bavail on your system.
1Yes, the 400 MB Fujitsu Eagle was a large (in multiple senses: it used a 19 inch rack mount setup) drive back then. People are spoiled today with their multi-terabyte SSDs. 😀

Is it possible to resize MTD partitions at runtime?

I have a very specific need:
to partially replace the content of a flash and to move MTD partition boundaries.
Current map is:
u-boot 0x000000 0x040000
u-boot-env 0x040000 0x010000
kernel 0x050000 0x230000
initrd 0x280000 0x170000
scripts 0x3f0000 0x010000
filesystem 0x400000 0xbf0000
firmware 0xff0000 0x010000
While desired output is:
u-boot 0x000000 0x040000
u-boot-env 0x040000 0x010000
kernel 0x050000 0x230000
filesystem 0x280000 0xd70000
firmware 0xff0000 0x010000
This means to collapse initrd, scripts and filesystem into a single area while leaving the others alone.
Problem is this should be achieved from the running system (booted with the "old" configuration") and I should rewrite kernel and "new" filesystem before rebooting.
The system is an embedded, so I have little space for maneuver (I have a SD card, though).
Of course the rewritten kernel will have "new" configuration written in its DTB.
Problem is transition.
Note: I have seen this Question, but it is very old and it has drawback to need kernel patches, which I would like to avoid.
NOTE2: this question has been flagged for deletion because "not about programming". I beg to disagree: I need to perform said operation on ~14k devices, most of them already sold to customers, so any workable solution should involve, at the very least, scripting.
NOTE3: if absolutely necessary I can even consider (small) kernel modifications (YES, I have means to update kernel remotely).
I will leave the Accepted answer as-is, but, for anyone who happens to come here to find a solution, I want to point out that:
Recent (<4 years old) mtd-utils, coupled with 4.0+ kernel support:
Definition of a "master" device (MTD device representing the full, unpartitioned Flash). This is a kernel option.
mtd-utils has a specific mtd-part utility that can add/delete MTD partitions dynamically. NOTE: this utility woks IF (and only if) the above is defined in Kernel.
With the above utility it's possible to build multiple, possibly overlapping partitions; use with care!
I have three ideas/suggestions:
Instead of moving the partitions, can you just split the "new" filesystem image into chunks and write them to the corresponding "old" MTD partitions? This way you don't really need to change MTD partition map. After booting into the new kernel, it will see the new contiguous root filesystem. For JFFS2 filesystem, it should be fairly straightforward to do using split or dd, flash_erase and nandwrite. Something like:
# WARNING: this script assumes that it runs from tmpfs and the old root filesystem is already unmounted.
# Also, it assumes that your shell has arithmetic evaluation, which handles hex (my busybox 1.29 ash does this).
# assuming newrootfs.img is the image of new rootfs
new_rootfs_img="newrootfs.img"
mtd_initrd="/dev/mtd3"
mtd_initrd_size=0x170000
mtd_scripts="/dev/mtd4"
mtd_scripts_size=0x010000
mtd_filesystem="/dev/mtd5"
mtd_filesystem_size=0xbf0000
# prepare chunks of new filesystem image
bs="0x1000"
# note: using arithmetic evaluation $(()) to convert from hex and do the math.
# dd doesn't handle hex numbers ("dd: invalid number '0x1000'") -- $(()) works this around
dd if="${new_rootfs_img}" of="rootfs_initrd" bs=$(( bs )) count=$(( mtd_initrd_size / bs ))
dd if="${new_rootfs_img}" of="rootfs_scripts" bs=$(( bs )) count=$(( mtd_scripts_size / bs )) skip=$(( mtd_initrd_size / bs ))
dd if="${new_rootfs_img}" of="rootfs_filesystem" bs=$(( bs )) count=$(( mtd_filesystem_size / bs )) skip=$(( ( mtd_initrd_size + mtd_scripts_size ) / bs ))
# there's no going back after this point
flash_eraseall -j "${mtd_initrd}"
flash_eraseall -j "${mtd_scripts}"
flash_eraseall -j "${mtd_filesystem}"
nandwrite -p "${mtd_initrd}" rootfs_initrd
nandwrite -p "${mtd_scripts}" rootfs_scripts
nandwrite -p "${mtd_filesystem}" rootfs_filesystem
# don't forget to update the kernel too
There is kernel support for concatenating MTD devices (which is exactly what you're trying to do). I don't see an easy way to use it, but you could create a kernel module, which concatenates the desired partitions for you into a contiguous MTD device.
In order to combine the 3 MTD partitions into one to write the new filesystem, you could create a dm-linear mapping over the 3 mtdblocks, and then turn it back into an MTD device using block2mtd. (i.e. mtdblock + device mapper linear + block2mtd) But it looks very awkward and I don't know if it'll work well (for say, OOB data).
EDIT1: added a comment explaining use of $(( bs )) -- to convert from hex as dd doesn't handle hex numbers directly (neither coreutils, nor busybox dd).
AFAIK, #andrey 's answer suggestion 1 is wrong.
an mtd partition is made of a sequence of blocks, any of which could be bad or go bad anytime. this is why the simple mtd char abstraction exists: an mtd char device (not the mtdblock one) is read sequentially and skips bad blocks. nandwrite also writes sequentially and skips bad blocks.
an mtd char device sort of acts like:
a single file into which you cannot random access, from which you can only read sequentially from the beginning to the end (or to where you get bored).
a single file into which you cannot random access, to which you can only write sequentially from the beginning (or from an erase block where you previously stopped reading) all the way to the end. (that is, you can truncate and append, but you cannot write mid-file.) to write you need to previously erase all erase blocks from where you start writing to the end of the partition.
this means that the partition size is the maximum theoretical capacity, but typically the capacity will be less due to bad blocks, and can be effectively reduced every time you rewrite the partition. you can never expect to write the full size of an mtd partition.
this is were #andrey 's suggestion 1 is wrong: it breaks up the file to be written into max-sized pieces before writing each piece. but you never know beforehand how much data will fit into an mtd partition without actually writing that data.
instead, you typically need to write some data, and you pray there will be enough good blocks to fit it. if at some point there are not, the write fails and the device reached end-of-life. needless to say, the larger the fraction of a partition you need, the higher the likelihood that the write will fail (and when that happens, it typically means that the device is toast).
to actually implement something akin to suggestion 1, you need to start writing into a partition (skipping bad blocks), and when you run out of erase blocks, you continue writing into the next partition, and so on. the point being: you cannot know where the data boundaries will lay until you actually write the data and fill each partition; there is no other way.

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

xentop VBD_RD & VBD_WR output

I'm writing a perl script that track the output of xentop tool, I'm not sure what is the meaning for VBD_RD & VBD_WR
Following http://support.citrix.com/article/CTX127896
VBD_RD number displays read requests
VBD_WR number displays write requests
Dose anyone know how read & write requests are measured in bytes, kilobytes, megabytes??
Any ideas?
Thank you
As far as I understood, xentop shows you two different measuers (two for read and write).
VBD_RD and VBD_WR's measures are unit. The number of times you tried to access to the block device. This does not say anything about the the number of bytes you have read or written.
The second measure read (VBD_RSECT) and write (VBD_WSECT) are measured in "sectors". You can find the size of the sectors by using xenstore-ls (https://serverfault.com/questions/153196/xen-find-vbd-id-for-physical-disks) (in my case it was 512).
The unit of a sector is in bytes (http://xen.1045712.n5.nabble.com/xen-3-3-testing-blkif-Clarify-units-for-sector-sized-blkif-request-params-td2620172.html).
So if the VBD_WR value is 2, VBD_WSECT value is 10, sector size is 512. We have written 10 * 512 bytes in two different requests (you tried to access to the block device two times but we know nothing about how much bytes were written in each request but we only know the total). To find disk I/O you can periodically check these values and take the derivative between those values.
I suppose the sector size might change for each block device somehow, so it might be worthy to check the xenstore-ls output for each domain but I'm not sure. You can probably define it in the cfg file too.
This is what I found out and understood so far. I hope this helps.

CRC16 collision (2 CRC values of blocks of different size)

The Problem
I have a textfile which contains one string per line (linebreak \r\n). This file is secured using CRC16 in two different ways.
CRC16 of blocks of 4096 bytes
CRC16 of blocks of 32768 bytes
Now I have to modify any of these 4096 byte blocks, so it (the block)
contains a specific string
does not change the size of the textfile
has the same CRC value as the original block (same for the 32k block, that contains this 4k block)
Depart of that limitations I may do any modifications to the block that are required to fullfill it as long as the file itself does not break its format. I think it is the best to use any of the completly filled 4k blocks, not the last block, that could be really short.
The Question
How should I start to solve that problem? The first thing I would come up is some kind of bruteforce but wouldn't it take extremly long to find the changes that will result in both CRC values stay the same? Is there probably a mathematical way to solve that?
It should be done in seconds or max. few minutes.
There are math ways to solve this but I don't know them. I'm proposing a brute-force solution:
A block looks like this:
SSSSSSSMMMMEEEEEEE
Each character represents a byte. S = start bytes, M = bytes you can modify, E = end bytes.
After every byte added to the CRC it has a new internal state. You can reuse the checksum state up to that position that you modify. You only need to recalculate the checksum for the modified bytes and all following bytes. So calculate the CRC for the S-part only once.
You don't need to recompute the following bytes either. You just need to check whether the CRC state is the same or different after the modification you made. If it is the same, the entire block will also be the same. If it is different, the entire block is likely to be different (not guaranteed, but you should abort the trial). So you compute the CRC of just the S+M' part (M' being the modified bytes). If it equals the state of CRC(S+M) you won.
That way you have much less data to go through and a recent desktop or server can do the 2^32 trials required in a few minutes. Use parallelism.
Take a look at spoof.c. That will directly solve your problem for the CRC of the 4K block. However you will need to modify the code to solve the problem simultaneously for both the CRC of the 4K block and the CRC of the enclosing 32K block. It is simply a matter of adding more equations to solve. The code is extremely fast, running in O(log(n)) time, where n is the length of the message.
The basic idea is that you will need to solve 32 linear equations over GF(2) in 32 or more unknowns, where each unknown is a bit location that you are permitting to be changed. It is important to provide more than 32 unknowns with which to solve the problem, since if you pick exactly 32, it is not at all unlikely that you will end up with a singular matrix and no solution. The spoof code will automatically find non-singular choices of 32 unknown bit locations out of the > 32 that you provide.

Resources