I'm currently facing some strange problem on my distributed application.
This application generally do the following things:
Read and Write Data from an NFSv3 Filesystem
Read and Write Data from a tmpfs Filesystem
One process generate files on tmpfs and access to them with another process (or another java thread which in the end is a pthread)
One process generate files on NFSv3 and access to them with another process (or another java thread which in the end is a pthread)
Write data to NFSv3 and read the same data from another machine
We discovered many latency problems with NFSv3 but those problems are known: If you write a file on NFS and than try to read from another machine it can take up to 90 seconds to be available when the stat syscall is executed on the other machine.
So we implemented some retry code to address this issue.
Recently we spotted a similar behaviour also on tmpfs but since it's in ram I thought that at the end of a write the same machine with another thread executed at the end of the write should see the file but we got an error about it.
So we decided to implement again another retry block
The question is: is the tmpfs transactional when the code stop to write on the file ?
And more in general how on the different filesystem this concept is applied ?
Thanks
Marco
Related
I was under the impression that a block device is listed under /dev, so for example /dev/xvdf and that file systems live on a partition which is listed with a number behind the block device the partition is on, like /dev/xvdf1 and that all file systems must live on a partition.
I am running CentOS and as part of a course I have to create file systems, partitions and mount file systems. For this course, I have created a file system on device file /dev/xvdf and I have mounted this file system. In addition to that, I have created a partition on /dev/xvdf with the file name of /dev/xvdf1 and created a file system on this partition as well and mounted this file system. This confuses me and I have some questions:
Am I correct that you do not have to create a partition on a block device, but that you can create a file system on a block device directly without a partition?
If so, why would anyone want to do this?
After creating the file system on /dev/xvdf, I created the /dev/xvdf1 partition using fdisk and I allocated the max blocks to this new partition. However, the file system on /dev/xvdf was not removed and still had a file on it. How is this possible if all the blocks on /dev/xvdf have been allocated to the /dev/xvdf1 partition?
Question #1: you are correct. A file system needs only a contiguous space somewhere. You can also create file system in memory (virtual disk).
Question #2: the possibility of having a partition table is a good thing; but why use it if you don't need to break a disk (or other block device) in several pieces?
About question #3, I think you overlooked something - probably an error raised somewhere and you didn't notice, or some error will raise in future. Even if you have the impression, it can not work; the mounted filesystem thinks to own all the space reserved to it, and similarly fdisk thinks that the blocks it is using "can be used". BTW, what is that "/dev/xvdf"? Is it a real device or whatever?
I'm working on an embedded system where the rootfs is constructed in a tmpfs partition by the init process. After the rootfs is complete, it will do a pivot-root and start spawning processes located in the rootfs.
But it seems like XIP is not working for our tmpfs, and all the applications is therefore loaded into ram twice (in the tmpfs and again into ram when loaded).
Can this really be true?
I found an old discussion thread at https://ez.analog.com/thread/45262 which describe the same issue as I'm seeing.
How can I achieve XIP for a file-system located in memory?
What you are attempting to do should be indeed possible (though I haven't tried it myself). The problem is simply you are not going about it the correct way. If you use the block RAM device ("brd") you can create a block device that is actually RAM presented as a block device. To enable this on your kernel (sorry you do not say which kernel you have so I will just go with the kernel 4.14), you need to enable CONFIG_BLK_DEV_RAM as well as CONFIG_BLK_DEV_RAM_DAX in your kernel configuration. They are both under "Device Drivers" -> "Block Devices". Then you create such a RAM backed block device and then create for example an ext2 or ext4 or XFS file system on it and then prepare your rootfs into that file system and then pivot-root into it. Now you are executing in a RAM backed file system which has XIP (now replaced by DAX) functionality thus executing applications should now at least in theory work correctly without creating a copy of the data and simply running it out of the RAM pages of the block RAM device.
Please do beware that such approach has limitation such as for example that kernel modules themselves will still be copied into RAM, get_user_pages() may not work, O_DIRECT may not work, and neither might RDMA, sendfile() and splice().
Some relevant things to look at include:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/Kconfig?h=v3.19#n359
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/Kconfig?h=v3.19#n396
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/blockdev/ramdisk.txt?h=v3.19
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/xip.txt?h=v3.19
Note XIP was replaced by DAX since 4.0 kernel so there see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.txt?h=4.14
Also note that support for DAX was removed from block RAM driver with kernel 4.15 so you will no longer be able to do this once you move to kernel 4.15 and later... See commit 7a862fbbdec665190c5ef298c0c6ec9f3915cf45 for the reasoning behind removing the functionality.
I hope this is enough to set you on the right track and sorry about the bad news that the functionality has been removed since 4.15 kernel...
Does Linux need a writeable file system to function correctly? I'm just running a very simple init programme. Presently I'm not mounting any partitions. The Kernel has mounted the root partition as read-only. Is Linux designed to be able run with just a read-only file system as long as I stick to mallocs, readlines and text to standard out (puts), or does Linux require a writeable file system in-order even to perform standard text input and output?
I ask because I seem to be getting kernel panics and complaints about the stack. I'm not trying to run a useful system at the moment. I already have a useful system on another partition. I'm trying to keep it as simple as possible so as I can fully understand things before adding in an extra layer of complexity.
I'm running a fairly standard x86-64 desktop.
No, writable file system is not required. It is theoretically possible to run GNU/Linux with the only read-only file system.
In practice you probably want to mount /proc, /sys, /dev, possibly /dev/pts to everything work properly. Note that even some bash commands requires writable /tmp. Some other programs - writable /var.
You always can mount /tmp and /var as ramdisk.
Yes and No. No it doesn't need to be writeable if it did almost nothing useful.
Yes, you're running a desktop so it's needed to be writeable.
Many processes actually need a writeable filesystem as many system calls can create files. e.g. Unix Domain Sockets can create files.
Also many applications write into /var, and /tmp
The way to get around this is to mount the filesystem read/only and use a filesystem overlay to overlay an in memory filesystem. That way, the path will be writable but they go to ram and any changes are thrown away on reboot.
See: overlayroot
No it's not required. For example as most distributions have a live version of Linux for booting up for a cd or usb disk with actually using and back end hdd.
Also on normal installations, the root partitions are changed to read-only when there are corruptions on the disk. This way the system still comes up as read-only partition.
You need to capture the vmcore and the stack trace of the panic form the dmesg output to analyse further.
I have an embedded application that I am working on. To protect the data on this image its partitions are mounted RO (this helps prevent flash errors when the power is lost unexpectedly since I cannot guarantee clean shutdowns, you could pull the plug)
An application I am working that needs to be protected resides on this RO partition, however this program also needs to be able to change configuration files on the same RO file system. I have code that allows me to remount this partition RW as needed (eg for firmware updates), but this requires all the processes to be stopped that are running from the read only partition (ie killall my_application). Hence it is not possible for my application to remount the partition it needs to modify without first killing itself (I am not sure which one is the chicken and which one is the egg, but you get the gist).
Is there a way to start my application in such a way that the entire binary is kept in RAM and there is no link back to the partition from which it was run so that the unmount reports the partition as busy?
Or alternatively is there a way to safely remount this RO partition without first killing the process running on it?
You can copy it to a tmpfs filesystem and execute it from there. A tmpfs filesystem stores all data in RAM and sometimes on your SWAP partition.
Passing the -oremount flag to mount should also work.
Using linux, I can use raw access to NAND or access to files through filesystem. So, when I need to know, where my file is really located in NAND, what should I do? I cannot found any utilities providing this feature. Moreover, I cannot detect any possibility of this, besides hacking kernel with tons of "printk" (it's not nice way, I guess).
Can anybody enlighten me on this? (I'm using YAFFS2 and JFFS2 filesystems)
You can make a copy of any partition with nanddump. Transfer that partition dump to a PC. The nandsim utility can be used to mount the partitions on a PC.
modprobe nandsim first_id_byte=0x2c second_id_byte=0xda \
third_id_byte=0x90 fourth_id_byte=0x95 parts=2,64,64
flash_erase /dev/mtd3 0 0
ubiformat /dev/mtd3 -f rootfs.ubi
This command emulates a Micron 256MB NAND flash with four partitions. If you just capture the single partition and not the whole device, don't set parts. You can also do nanddump on each partition and then concatenate them all. This code targeted mtd3 with a UbiFs partition. For JFFS2 or YAFFS2, you can try nandwrite or some other appropriate flashing utility on the PC.
How to get real file offset in NAND by file name?
The files may span several NAND sectors and they are almost never contiguous. There is not much of an advantage to keep file data together as there is no disk head that takes physical time to seek. Some flash has marginally better efficiency for sequential reads; yet other flash will give better performance for reads from another erase block.
I would turn on debug at either the MTD layer or in the filesystem. In a live system, the position of the file may migrate over time on the flash even if it is not written. This is active wear leveling.