Getting the root device in a kernel module - linux

I did some web searches for this, but could only find results about getting the kernel module associated with a device node. Is there anyway I can get the major and minor numbers of the current system's root device and, if applicable, the root device's parent device (e.g., /dev/sda is the "parent" of /dev/sda2)? Does the kernel export some functions for getting this or would I need to get it indirectly?

There is no module associated with a device node. Possibly you know that the root directory is something local to a process (the process structure stores the inode reference for the root directory --- and this can be changed with the privileged chroot(2) system call) and the current working directory (to solve for paths not beginning with /)
If you want to know the device responsible of the root directory you have two options:
Your process has not been made a chroot(2) syscall, so you opendir("/") and then do a fstat(2) on it (or you can do a stat(2) syscall on the "/" directory). This will give the device in which the root directory resides as the st_dev field of the struct stat returns. It is formatted as a dev_t number, in which some of the bits represent the major number and some the minor number. You can use the MKDEV(ma,mi) and MAJOR(dev) and MINOR(dev) macros defined in <linux/kdev_t.h> to access the major and minor numbers. To get the physical disk, just mask the minor number with 0xf0 and you will get the minor number of the whole disk.
your process has made a chroot(2) syscall, so you are not allowed to access the real root directory in the system. If you have access to the /proc filesystem, then probably you can call mount(1) command to get the mount table. you can search that table for the / entry, and then get the /dev/sd<disk> entry. Once you got the device, getting the parent device is easy. You can mask the number as you did in the last point to get the minor number of the physical disk.
You can also get to the /proc/diskstats file, that shows you the statistics of each block device. You'll get the major, minor and device name in the first three fields of each line.
NOTE
There are some disk arrangementes that dont't allow partitioning, as RAID devices or volume manager disks. In those cases, getting to the physical disk (or disks, as there can be more than one) is more difficult.

Related

Immutable names in /dev/disk

There are four entries in /dev/disk which I am interested in.
by-id
by-label
by-path
by-uuid
Which of the entries contain immutable names for physical drives? By immutable, I mean that the name shouldn't change if I
change the usb/pci port used to connect to the drive.
destroy and create partitions (GPT).
move from one computer to another (external hard-drive).
For example, /dev/sda can change to /dev/sdb if a different flash drive is connected. But the UUID stays the same. I don't mind if a partition's path changes (I think the UUID changes if you destroy and then recreate a partition), but the complete physical drive must stay at the same location (/dev/sdX may change, but the UUID doesn't when the usb port is changed).
Please suggest relevant tags.
Edit -
Can you say the same for partlabel and partuuid?
In short: you can use by-label or by-uuid to keep names immutable.
In detail:
Disk names (/dev/sdX) are given by kernel based on controller priority (master/slave) disk attached to. If you are moving disk from one USB port to another, for kernel it is like switching a controller. This is why names are changed from /dev/sda to /dev/sdb.
The directory /dev/disk is related to filesystem located on the disk. Label and uuid are filesystem attributes which are given on filesystem creation and can be changed after.
They are immutable and can survive:
disk migration from one computer to another.
disk migration from one controller to another on same computer.
However by-label and by-uuid will not survive if you destroy the partition, but the same label, uuid names can be given upon filesystem creation. So newly created filesystem will be mounted at same mount point.
I personally prefer to use by-label as it supported by many filesystems, short and descriptive.
More information about persistent block device naming.

Port Window Api---GetVolumeInformation to Linux

win Api
WINAPI GetVolumeInformation(
_In_opt_ LPCTSTR lpRootPathName,
_Out_opt_ LPTSTR lpVolumeNameBuffer,
_In_ DWORD nVolumeNameSize,
_Out_opt_ LPDWORD lpVolumeSerialNumber,
_Out_opt_ LPDWORD lpMaximumComponentLength,
_Out_opt_ LPDWORD lpFileSystemFlags,
_Out_opt_ LPTSTR lpFileSystemNameBuffer,
_In_ DWORD nFileSystemNameSize
);
Hello: I want to port windows api GetVolumeInformation to Linux.
Q1:Does Linux have the same function.
Q2:if not.
Q2.1 what is lpVolumeNameBuffer in linux(is /dev/sda1)? how can i get it in linux?
Q2.2 what is lpVolumeSerialNumber in linux(is )? i use ioctl get it.
struct hd_driveid id;ioctl(fd, HDIO_GET_IDENTITY, &id);
Q2.3 what is lpMaximumComponentLength in linux? how can i get it in linux?
Q2.4 what is lpFileSystemFlags in linux? how can i get it in linux?
Q2.5 what is lpFileSystemNameBuffer? how can i get it in linux?
If you have any good ideas I would really appreciate it.
Thanks!
Conceptually, linux doesn't have volumes in the same way that Windows does - linux has mount points, Windows has 'drive letters', for lack of a better term.
lpVolumeName is the friendly name of a mounted volume - for instance, my C: drive is labeled 'main_disk'. The point of this label is only to give a friendly name to the drive, and the label can change whenever the user decides to change it, and it does not affect the structure of the ultimate file system layout.
In linux, volumes are mounted as mount points, eg, the device referred to as /dev/sda2 might be mounted at the /var mount point. Here, /var is part of the file system, and thus, mount points determine the structure of the ultimate file system layout. This is not simply a friendly name for the user to give to their disk so that they can know what it is.
Linux and others do support something called disk labeling, but that's used to be able to refer to the disk using a stable name, instead of its device name, which could change if the hard drives are moved around in the computer. For instance, in FreeBSD, I could label my main hard drive root; when that hard drive is detected during boot up, i can instead refer to it as /dev/label/root and specify the mount point for it using that name. However, this is still being used to determine the ultimate structure of the file system - it's a functional dependency - and so the user can't change it willy-nilly without breaking something or having to change the fstab file that describes device-to-mount-point mappings.
lpVolumeSerialNumber has to do with the filesystem on the volume; that is, this field is specific to what filesystem is being used on the volume, and isn't something that all volumes will have.
In Windows, typically two filesystems are supported - Fat32 and NTFS - both of which can give a serial number to a file system. In Linux, FreeBSD, etc, there are many, many file systems - UFS/UFS2, EXT/EXT2/EXT3/EXT4, ReiserFS/Reiser4, BTRFS, ZFS, FFS, etc. Whether or not a volume has a serial number depends on what file system the volume is using, and not all filesystems support serial numbers. Each filesystem will have its own utility commands to query this sort of data - for instance, dumpfs on FreeBSD for UFS2 file systems.
The list goes on. Unfortunately, there are no direct analogs between Windows and Linux for the parts you're asking about, but as I've shown, sometimes it doesn't matter and when it does, you can usually find something to replace it with.
POSIX systems have statvfs()/fstatvfs() subroutines (which might be a library function or system call, depending on the OS). Perhaps they are what most closely resemble what the Windows' function you named does, but they have a very different interface.
I don't know if it matters to you, but a related problem is to enumerate mounted filesystems. To enumerate currently mounted filesystems on Linux, you can read the contents of /proc/mounts. Some other UNIX flavors (namely BSD-derived systems) have getvfsstat()/getfsstat() calls for the same purpose. Solaris has (or used to have) neither, and you best choice was to read /etc/mnttab. AIX has none of these, and your (only?) reliable (?) option to enumerate current mounts is to parse the output of the mount command, run without any argument.

Can inode and crtime be used as a unique file identifier?

I have a file indexing database on Linux. Currently I use file path as an identifier.
But if a file is moved/renamed, its path is changed and I cannot match my DB record to the new file and have to delete/recreate the record. Even worse, if a directory is moved/renamed, then I have to delete/recreate records for all files and nested directories.
I would like to use inode number as a unique file identifier, but inode number can be reused if file is deleted and another file created.
So, I wonder whether I can use a pair of {inode,crtime} as a unique file identifier.
I hope to use i_crtime on ext4 and creation_time on NTFS.
In my limited testing (with ext4) inode and crtime do, indeed, remain unchanged when renaming or moving files or directories within the same file system.
So, the question is whether there are cases when inode or crtime of a file may change.
For example, can fsck or defragmentation or partition resizing change inode or crtime or a file?
Interesting that
http://msdn.microsoft.com/en-us/library/aa363788%28VS.85%29.aspx says:
"In the NTFS file system, a file keeps the same file ID until it is deleted."
but also:
"In some cases, the file ID for a file can change over time."
So, what are those cases they mentioned?
Note that I studied similar questions:
How to determine the uniqueness of a file in linux?
Executing 'mv A B': Will the 'inode' be changed?
Best approach to detecting a move or rename to a file in Linux?
but they do not answer my question.
{device_nr,inode_nr} are a unique identifier for an inode within a system
moving a file to a different directory does not change its inode_nr
the linux inotify interface enables you to monitor changes to inodes (either files or directories)
Extra notes:
moving files across filesystems is handled differently. (it is infact copy+delete)
networked filesystems (or a mounted NTFS) can not always guarantee the stability of inodenumbers
Microsoft is not a unix vendor, its documentation does not cover Unix or its filesystems, and should be ignored (except for NTFS's internals)
Extra text: the old Unix adagium "everything is a file" should in fact be: "everything is an inode". The inode carries all the metainformation about a file (or directory, or a special file) except the name. The filename is in fact only a directory entry that happens to link to the particular inode. Moving a file implies: creating a new link to the same inode, end deleting the old directory entry that linked to it.
The inode metatata can be obtained by the stat() and fstat() ,and lstat() system calls.
The allocation and management of i-nodes in Unix is dependent upon the filesystem. So, for each filesystem, the answer may vary.
For the Ext3 filesystem (the most popular), i-nodes are reused, and thus cannot be used as a unique file identifier, nor is does reuse occur according to any predictable pattern.
In Ext3, i-nodes are tracked in a bit vector, each bit representing a single i-node number. When an i-node is freed, it's bit is set to zero. When a new i-node is needed, the bit vector is searched for the first zero-bit and the i-node number (which may have been previously allocated to another file) is reused.
This may lead to the naive conclusion that the lowest numbered available i-node will be the one reused. However, the Ext3 file system is complex and highly optimised, so no assumptions should be made about when and how i-node numbers can be reused, even though they clearly will.
From the source code for ialloc.c, where i-nodes are allocated:
There are two policies for allocating an inode. If the new inode is a
directory, then a forward search is made for a block group with both
free space and a low directory-to-inode ratio; if that fails, then of
he groups with above-average free space, that group with the fewest
directories already is chosen. For other inodes, search forward from
the parent directory's block group to find a free inode.
The source code that manages this for Ext3 is called ialloc and the definitive version is here: https://github.com/torvalds/linux/blob/master/fs/ext3/ialloc.c
I guess the dB application would need to consider the case where the file is subject to restoration from backup, which would preserve the file crtime, but not the inode number.

Getting disk sector size without raw filesystem permission

I'm trying to get the sector size, specifically so I can correctly size the buffer for reading/writing with O_DIRECT.
The following code works when my app's run as root:
int fd = open("/dev/xvda1", O_RDONLY|O_NONBLOCK);
size_t blockSize;
int rc = ioctl(fd, BLKSSZGET, &blockSize);
How can I get the sector size without it being run as root?
According to the Linux manpage for open():
In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).
So it looks like you may be able to obtain this information using xfsctl()... if you are using xfs.
Since your underlying block device is a Xen virtual block device and there might be any number of layers below that (LVM, dm-crypt, another filesystem, etc...) I'm not sure how meaningful all of this will really be for you.
You could use the stat(2) and related syscall (perhaps on some particular file), then use the st_blksize field. However this would give a file-system related blocksize, not the size of the sector as preferred by the hardware. But for O_DIRECT input (from a file on filesystem!) that st_blocksize might be more relevant.
Otherwise, I would suggest a power-of-two size, perhaps 8Kbytes or 64Kbytes, as the size of your O_DIRECT-ed reads (and you may want to align your read buffer to the page size, usually 4Kbytes).

Linux procfs inode number changed whill process was running

I'm working on security software(SW) for Linux.
One thing that our SW does on is that when some process is started, the SW stat()s the process's /proc/ entry and remembers the entry's inode number.
When later on the SW needs to ascertain that the process is still running (and hasn't been restarted), it again looks up process's inode and compares to the one remembered.
All was fine and dandy until recently I began receiving false alerts for a specifc application - Opera browser 11.10beta.
Basically it appears that while Opera was running, the inode number for its /proc/PID entry has changed, which we considered an impossibility.
This is a rather big spanner in the works of the SW's security concept - so much relied on the fact that while a process is running, its /proc/ entry's inode remains unchanged.
Could someone please advise as to why such behaviour may be exhibited.
Thanks.
+1 for the defensive programming habits.
Disclaimer
In case it isn't obious: I'm just brainstorming along here. It is clear we cannot just give the answer instantaneously, and my thoughts didn't fit in a comment; I will delete this is it doesn't lead to a solution
I'd certainly make sure that the opera hasn't forked/exec-ed itself (sorry that probably insults your intelligence :));
Next, have a look at namespaces and chrooting
http://vincent.bernat.im/en/blog/2011-jchroot-isolation.html
http://manpages.ubuntu.com/manpages/oneiric/man1/schroot.1.html
Edit
[patch 08/12] procfs: inode defragmentation support
Edit
I'd say that the process ID must have changed (or procfs remounted, visibly to the user process?):
Under /proc we can find general system information and specific process information and statistics. Linux distinguishes different types of information with the inode number. An inode number in Linux is represented as a 32 bit number and a PID (Process Identifier) is represented as a 16 bit number. With this schema, Linux splits the inode number in two halves of 16 bit. The left half is interpreted as a PID number and the right one is interpreted as a class of information. Since a PID=0 is not valid, Linux uses this value to indicate that inode contains global information. (source)
Thanks to sehe for pointing in the right direction and to Random832 for finally nailing it.
I ran a process and monitored its PID ls -i /proc/21314 . Alas! Every single entry under that directory had its inode number changed after approx. 15 minutes.
So inode numbers were never meant to be permanent in procfs :(

Resources