Getting length of /proc/self/exe symlink - linux

As mentioned on SO, readlink on /proc/self/exe can be used to get the executable path on linux. man 2 readlink recommends that one should use lstat to extract the required length of a path. However, when I stat /proc/self/exe, the st_size member is set to 0. How can I get the length for allocating the buffer?

taken from man 2 lstat, under NOTES
For most files under the /proc directory, stat() does not return
the file size in the st_size field; instead the field is returned
with the value 0.
That's why it does not work

In practice, I would tend to use a reasonable size (e.g. 256 or 1024, or PATH_MAX) for readlink of /proc/*/exe (or /proc/self/exe)
The point is that almost always, executables are supposed to be started by humans, so either the PATH (for execvp(3) or some shell) or the entire file path is human friendly. I don't know any people who explicitly uses very long filenames (not fitting in width in some terminal screen). I never heard of executable programs (or scripts) whose filename exceeds a hundred of bytes.
So just use a local buffer of some reasonable size (and perhaps strdup it on success if so needed). And readlink(2) returns the number of meaningful bytes in its buffer (so if you really care, grow the buffer and make a loop till it fits).
For readlink of /proc/self/exe, I would do it into a 256 bytes buffer at initialization, and abort (with a meaningful error message) if it does not fit (or fail, e.g. because /proc/ is not mounted).

Related

In order to obtain the size of ELF binary, what's the difference between size and ls?

Test is on 32bit x86 Linux.
In order to get the size of some ELF binaries, I tried these two commands:
ls -la sha512sum
size sha512sum
But the thing is that the size output are different:
ls -la sha512sum
-rwxrwxr-x 1 szw175 szw175 95856 Oct 10 07:50 sha512sum
size sha512sum
text data bss dec hex filename
89644 488 452 90584 161d8 sha512sum
So my question is, in order to evaluate the size of an ELF binary, which method is more reliable? Why are these two methods different?
size(1) tells you the sizes of the various sections within the file. ls(1) tells you the number of bytes the ELF file contains. They serve completely different purposes, and which one is more "reliable" depends completely upon what you're going to do with the file.
You could think of a elf-file contain various information about loading & linking the program at runtime. All information that is given as input (.text, .data, .bss, .rel.*, etc.) all are stored in various sections of the elf-file. These section are managed by a section-header table stored somewhere in the binary.
You get the size of contents of the sections by
size sha512sum
but, if you want to get the total size of the file (which includes along with section contents - elf-header, program-header table, and section-header table), then you will use
ls -la sha512sum
Note that when the program is loaded (at any address, namely base address), the contents of the sections are mapped at various offsets from the base address. The mapping may not be continous and the runtime-image of the program may be larger than the file-size. Also, note that some sections (like .bss which only contain zeros) are not even stored in the file. The program-loader maps a memory-region and fills zeros in them, instead of copying zeros from the file. This saves alot of disk space & reduces file size (and thus, the time to load the binary into memory).
So, the memory-image size of the program could be larger than the file-size.

Linux read operations requesting duplicate bytes?

This is a bit of a strange question. I'm writing a fuse module using the go-fuse library, and at the moment I have a "fake" file with a size of 6000 bytes, and which will output some unrelated data for all read requests. My read function looks like this:
func (f *MyFile) Read(buf []byte, off int64) (fuse.ReadResult, fuse.Status) {
log.Printf("Reading into buffer of len %d from %d\n",len(buf),off)
FillBuffer(buf, uint64(off), f.secret)
return fuse.ReadResultData(buf), fuse.OK
}
As you can see I'm outputting a log on every read containing the range of the read request. The weird thing is that when I cat the file I get the following:
2013/09/13 21:09:03 Reading into buffer of len 4096 from 0
2013/09/13 21:09:03 Reading into buffer of len 8192 from 0
So cat is apparently reading the first 4096 bytes of data, discarding it, then reading 8192 bytes, which encompasses all the data and so succeeds. I've tried with other programs too, including hexdump and vim, and they all do the same thing. Interestingly, if I do a head -c 3000 dir/fakefile it still does the two reads, even though the later one is completely unnecessary. Does anyone have any insights into why this might be happening?
I suggest you strace your cat process to see for yourself. On my system, cat reads by 64K chunks, and does a final read() to make sure it read the whole file. That last read() is necessary to make the distinction between a reading a "chunk-sized file" and a bigger file. i.e. it makes sure there is nothing left to read, as the file size could have changed between the fstat() and the read() system calls.
Is your "fake file" size being returned correctly to FUSE by stat/fstat() system calls?

how can i show the size of files in /proc? it should not be size zero

from the following message, we know that there are two characters in file /proc/sys/net/ipv4/ip_forward, but why ls just showed this file is of size zero?
i know this is not a file on disk, but a file in the memory, so is there any command which i can see the real size of the files in /proc?
root#OpenWrt:/proc/sys/net/ipv4# cat ip_forward | wc -c
2
root#OpenWrt:/proc/sys/net/ipv4# ls -l ip_forward
-rw-r--r-- 1 root root 0 Sep 3 00:20 ip_forward
root#OpenWrt:/proc/sys/net/ipv4# pwd
/proc/sys/net/ipv4
Those are not really files on disk (as you mention) but they are also not files in memory - the names in /proc correspond to calls into the running kernel in the operating system, and the contents are generated on the fly.
The system doesn't know how large the files would be without generating them, but if you read the "file" twice there's no guarantee you get the same data because the system may have changed.
You might be looking for the program
sysctl -a
instead.
Things in /proc are not really files. In most cases, they're not even files in memory. When you access these files, the proc filesystem driver performs a system call that gets data appropriate for the file, and then formats it for output. This is usually dynamic data that's constructed on the fly. An example of this is /proc/net/arp, which contains the current ARP cache.
Getting the size of these things can only be done by formatting the entire output, so it's not done just when listing the file. If you want the sizes, use wc -c as you did.
The /proc/ filesystem is an "illusion" maintained by the kernel, which does not bother giving the size of (most of) its pseudo-files (since computing that "real" size would usually involve having built the entire textual pseudo-file's content), and expects most [pseudo-] textual files from /proc/ to be read in sequence from first to last byte (i.e. till EOF), in reasonably sized (e.g. 1K) blocks. See proc(5) man page for details.
So there is no way to get the true size (of some file like /proc/self/maps or /proc/sys/net/ipv4/ip_forward) in a single syscall (like stat(2), because it would give a size of 0, as reported by stat(1) or ls(1) commands). A typical way of reading these textual files might be
FILE* f = fopen("/proc/self/maps", "r");
// or some other textual /proc file,
// e.g. /proc/sys/net/ipv4/ip_forward
if (f)
{
do {
// you could use readline instead of fgets
char line[256];
memset (line, 0, sizeof(line));
if (NULL == fgets(line, sizeof(line), f))
break;
// do something with line, for example:
fputs(line, stdout);
} while (!feof (f));
fclose (f);
}
Of course, some files (e.g. /proc/self/cmdline) are documented as possibly containing NUL bytes. You'll need some fread for them.
It's not really a file in the memory, it's an interface between the user and the kernel.

How to create a large file on a VFAT partition efficiently in embedded Linux

I'm trying to create a large empty file on a VFAT partition by using the `dd' command in an embedded linux box:
dd if=/dev/zero of=/mnt/flash/file bs=1M count=1 seek=1023
The intention was to skip the first 1023 blocks and write only 1 block at the end of the file, which should be very quick on a native EXT3 partition, and it indeed is. However, this operation turned out to be quite slow on a VFAT partition, along with the following message:
lowmem_shrink:: nr_to_scan=128, gfp_mask=d0, other_free=6971, min_adj=16
// ... more `lowmem_shrink' messages
Another attempt was to fopen() a file on the VFAT partition and then fseek() to the end to write the data, which has also proved slow, along with the same messages from the kernel.
So basically, is there a quick way to create the file on the VFAT partition (without traversing the first 1023 blocks)?
Thanks.
Why are VFAT "skipping" writes so slow ?
Unless the VFAT filesystem driver were made to "cheat" in this respect, creating large files on FAT-type filesystems will always take a long time. The driver, to comply with FAT specification, will have to allocate all data blocks and zero-initialize them, even if you "skip" the writes. That's because of the "cluster chaining" FAT does.
The reason for that behaviour is FAT's inability to support either:
UN*X-style "holes" in files (aka "sparse files")
that's what you're creating on ext3 with your testcase - a file with no data blocks allocated to the first 1GB-1MB of it, and a single 1MB chunk of actually committed, zero-initialized blocks) at the end.
NTFS-style "valid data length" information.
On NTFS, a file can have uninitialized blocks allocated to it, but the file's metadata will keep two size fields - one for the total size of the file, another for the number of bytes actually written to it (from the beginning of the file).
Without a specification supporting either technique, the filesystem would always have to allocate and zerofill all "intermediate" data blocks if you skip a range.
Also remember that on ext3, the technique you used does not actually allocate blocks to the file (apart from the last 1MB). If you require the blocks preallocated (not just the size of the file set large), you'll have to perform a full write there as well.
How could the VFAT driver be modified to deal with this ?
At the moment, the driver uses the Linux kernel function cont_write_begin() to start even an asynchronous write to a file; this function looks like:
/*
* For moronic filesystems that do not allow holes in file.
* We may have to extend the file.
*/
int cont_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata,
get_block_t *get_block, loff_t *bytes)
{
struct inode *inode = mapping->host;
unsigned blocksize = 1 << inode->i_blkbits;
unsigned zerofrom;
int err;
err = cont_expand_zero(file, mapping, pos, bytes);
if (err)
return err;
zerofrom = *bytes & ~PAGE_CACHE_MASK;
if (pos+len > *bytes && zerofrom & (blocksize-1)) {
*bytes |= (blocksize-1);
(*bytes)++;
}
return block_write_begin(mapping, pos, len, flags, pagep, get_block);
}
That is a simple strategy but also a pagecache trasher (your log messages are a consequence of the call to cont_expand_zero() which does all the work, and is not asynchronous). If the filesystem were to split the two operations - one task to do the "real" write, and another one to do the zero filling, it'd appear snappier.
The way this could be achieved while still using the default linux filesystem utility interfaces were by internally creating two "virtual" files - one for the to-be-zerofilled area, and another for the actually-to-be-written data. The real file's directory entry and FAT cluster chain would only be updated once the background task is actually complete, by linking its last cluster with the first one of the "zerofill file" and the last cluster of that one with the first one of the "actual write file". One would also want to go for a directio write to do the zerofilling, in order to avoid trashing the pagecache.
Note: While all this is technically possible for sure, the question is how worthwhile would it be to do such a change ? Who needs this operation all the time ? What would side effects be ?
The existing (simple) code is perfectly acceptable for smaller skipping writes, you won't really notice its presence if you create a 1MB file and write a single byte at the end. It'll bite you only if you go for filesizes on the order of the limits of what the FAT filesystem allows you to do.
Other options ...
In some situations, the task at hand involves two (or more) steps:
freshly format (e.g.) a SD card with FAT
put one or more big files onto it to "pre-fill" the card
(app-dependent, optional)
pre-populate the files, or
put a loopback filesystem image into them
One of the cases I've worked on we've folded the first two - i.e. modified mkdosfs to pre-allocate/ pre-create files when making the (FAT32) filesystem. That's pretty simple, when writing the FAT tables just create allocated cluster chains instead of clusters filled with the "free" marker. It's also got the advantage that the data blocks are guaranteed to be contiguous, in case your app benefits from this. And you can decide to make mkdosfs not clear the previous contents of the data blocks. If you know, for example, that one of your preparation steps involves writing the entire data anyway or doing ext3-in-file-on-FAT (pretty common thing - linux appliance, sd card for data exchange with windows app/gui), then there's no need to zero out anything / double-write (once with zeroes, once with whatever-else). If your usecase fits this (i.e. formatting the card is a useful / normal step of the "initialize it for use" process anyway) then try it out; a suitably-modified mkdosfs is part of TomTom's dosfsutils sources, see mkdosfs.c search for the -N command line option handling.
When talking about preallocation, as mentioned, there's also posix_fallocate(). Currently on Linux when using FAT, this will do essentially the same as a manual dd ..., i.e. wait for the zerofill. But the specification of the function doesn't mandate it being synchronous. The block allocation (FAT cluster chain generation) would have to be done synchronously, but the VFAT on-disk dirent size update and the data block zerofills could be backgrounded / delayed (i.e. either done at low-prio in background or only done if explicitly requested via fdsync() / sync() so that the app can e.g. alloc blocks, write the contents with non-zeroes itself ...). That's technique / design though; I'm not aware of anyone having done that kernel modification yet, if only for experimenting.

Doing file operations with 64-bit addresses in C + MinGW32

I'm trying to read in a 24 GB XML file in C, but it won't work. I'm printing out the current position using ftell() as I read it in, but once it gets to a big enough number, it goes back to a small number and starts over, never even getting 20% through the file. I assume this is a problem with the range of the variable that's used to store the position (long), which can go up to about 4,000,000,000 according to http://msdn.microsoft.com/en-us/library/s3f49ktz(VS.80).aspx, while my file is 25,000,000,000 bytes in size. A long long should work, but how would I change what my compiler(Cygwin/mingw32) uses or get it to have fopen64?
The ftell() function typically returns an unsigned long, which only goes up to 232 bytes (4 GB) on 32-bit systems. So you can't get the file offset for a 24 GB file to fit into a 32-bit long.
You may have the ftell64() function available, or the standard fgetpos() function may return a larger offset to you.
You might try using the OS provided file functions CreateFile and ReadFile. According to the File Pointers topic, the position is stored as a 64bit value.
Unless you can use a 64-bit method as suggested by Loadmaster, I think you will have to break the file up.
This resource seems to suggest it is possible using _telli64(). I can't test this though, as I don't use mingw.
I don't know of any way to do this in one file, a bit of a hack but if splitting the file up properly isn't a real option, you could write a few functions that temp split the file, one that uses ftell() to move through the file and swaps ftell() to a new file when its reaching the split point, then another that stitches the files back together before exiting. An absolutely botched up approach, but if no better solution comes to light it could be a way to get the job done.
I found the answer. Instead of using fopen, fseek, fread, fwrite... I'm using _open, lseeki64, read, write. And I am able to write and seek in > 4GB files.
Edit: It seems the latter functions are about 6x slower than the former ones. I'll give the bounty anyone who can explain that.
Edit: Oh, I learned here that read() and friends are unbuffered. What is the difference between read() and fread()?
Even if the ftell() in the Microsoft C library returns a 32-bit value and thus obviously will return bogus values once you reach 2 GB, just reading the file should still work fine. Or do you need to seek around in the file, too? For that you need _ftelli64() and _fseeki64().
Note that unlike some Unix systems, you don't need any special flag when opening the file to indicate that it is in some "64-bit mode". The underlying Win32 API handles large files just fine.

Resources