I have a multithreaded application that is opening and reading the same file (not writing). I am opening a different file descriptor for each thread (but they all point to the same file). Each thread then reads the file and may close it and open it again if EOF is reached. Is this ok? If I perform fclose() on a file descriptor does it affect the other file descritptors that point to the same file?
For Linux systems you don't need multiple file descriptors to do this. You can share a single file descriptor and use pread to atomically do a seek / read operation without modifying the file descriptor at all.
That's ok. You can open all times you want the same file and each file descriptor will be independent from each other.
That should work fine, provided each thread has its own file handle. Since you mention use of fclose(), that suggests you are also using fopen() in each thread and each thread only affects its own FILE * variable.
Is there a problem?
Related
I understand that threads share almost everything in the PCB (except PC and stack), including the file descriptor table. The file descriptor table entry is a pointer to the system wide open file table, where each entry has a Op.Type, File offset and File Data. If a process is opening a file and creates multiple threads reading from the same file (file descriptor) using system call read, why will each threads read a different part of the file? (Given they access the same file descriptor in the same table, and thus same file and same offset?)
Answer from Kaylum:
If the threads are using the same file descriptor then if any thread does a read it will move the file offset for all threads. Hence when another thread does a read it will not be the same as what the first thread read but rather continues from where the first thread stopped reading.
So they do access the same entry on the system wide file table, but since each thread changes the file offset for all the threads, they do not read the same part of the file.
In Linux when I call open() on a file, is the complete data loaded on the ram?
I am asking this because the file i need to open is a huge file(around 700MB). Such a large file, if it is getting loaded, it might create a problem for other applications and making the system slower.
If complete data is not loaded(because on demand paging or similar things), then what can we assume to be the initial size that is loaded?
Just if it matters, here are my file details.
File type : Binary
File Size : varies in between 700MB - 2GB
No it doesn't. It just gives file a descriptor called file descriptor which then you can use to do read / write and some other operations. Think file descriptor as an abstraction or an handle over what lays on disk.
You should also read manual page for open. It states
The open() function shall establish the connection between a file and
a file descriptor. It shall create an open file description that
refers to a file and a file descriptor that refers to that open file
description.
In Linux when I call open() on a file, is the complete data loaded on the ram?
No.
what can we assume to be the initial size that is loaded?
Zero.
As a Linux device driver developer i was in the idea that file object is local structure to every process and its address is available in the fd table for the corresponding fd. But when i came across section 5.6 in Linux Programming interface by Michale Kerrisk which states that
Two different file descriptors that refer to the same open file description share
a file offset value. Therefore, if the file offset is changed via one file descriptor
(as a consequence of calls to read(), write(), or lseek()), this change is visible
through the other file descriptor. This applies both when the two file descrip
tors belong to the same process and when they belong to different processes.
I am befuddled...Kindly some one help me improve my understanding.
Each process does have its own file descriptor table, and each time a file is open()ed yields a separate file description. So there is sanity there!
The exception is when a file descriptor is duplicated, either within a process (via dup()) or across processes (by one process fork()ing a copy with all the same FDs, or by passing a file descriptor through a UNIX domain socket). When this happens, the two descriptors end up sharing some properties with each other, including the offset.
This is not necessarily a bad thing. It means, for instance, that two processes that are both writing to a shared file descriptor will not end up overwriting each other's output. It can sometimes have unexpected results, though. But it's not usually something that you'd end up with without knowing about it.
Say I want to synchronize data buffers of a file system to disk (in my case the one of an USB stick partition) on a linux box.
While searching for a function to do that I found the following
DESCRIPTION
sync() causes all buffered modifications to file metadata and
data to be written to the underlying file sys‐
tems.
syncfs(int fd) is like sync(), but synchronizes just the file system
containing file referred to by the open file
descriptor fd.
But what if the file system has no file on it that I can open and pass to syncfs? Can I "abuse" the dot file? Does it appear on all file systems?
Is there another function that does what I want? Perhaps by providing a device file with major / minor numbers or some such?
Yes I think you can do that. The root directory of your file system will have at least one inode for your root directory. You can use the .-file to do that. Play also around with ls -i to see the inode numbers.
Is there a possibility to avoid your problem by mounting your file system with sync? Does performance issues hamper? Did you have a look at remounting? This can sync your file system as well in particular cases.
I do not know what your application is, but I suffered problems with synchronization of files to a USB stick with the FAT32-file system. It resulted in weird read and write errors. I can not imagine any other valid reason why you should sync an empty file system.
From man 8 sync description:
"sync writes any data buffered in memory out to disk. This can include (but is not
limited to) modified superblocks, modified inodes, and delayed reads and writes. This
must be implemented by the kernel; The sync program does nothing but exercise the sync(2)
system call."
So, note that it's all about modification (modified inode, superblocks etc). If you don't have any modification, it don't have anything to sync up.
In windows, if I open a file with MS Word, then try to delete it.
The system will stop me. It prevents the file being deleted.
There is a similar mechanism in Linux?
How can I implement it when writing my own program?
There is not a similar mechanism in Linux. I, in fact, find that feature of windows to be an incredible misfeature and a big problem.
It is not typical for a program to hold a file open that it is working on anyway unless the program is a database and updating the file as it works. Programs usually just open the file, write contents and close it when you save your document.
vim's .swp file is updated as vim works, and vim holds it open the whole time, so even if you delete it, the file doesn't really go away. vim will just lose its recovery ability if you delete the .swp file while it's running.
In Linux, if you delete a file while a process has it open, the system keeps it in existence until all references to it are gone. The name in the filesystem that refers to the file will be gone. But the file itself is still there on disk.
If the system crashes while the file is still open it will be cleaned up and removed from the disk when the system comes back up.
The reason this is such a problem in Windows is that mandatory locking frequently prevents operations that should succeed from succeeding. For example, a backup process should be able to read a file that is being written to. It shouldn't have to stop the process that is doing the writing before the backup proceeds. In many other cases, operations that should be able to move forward are blocked for silly reasons.
The semantics of most Unix filesystems (such as Linux's ext2 fs family) is that a file can be unlink(2)'d at any time, even if it is open. However, after such a call, if the file has been opened by some other process, they can continue to read and write to the file through the open file descriptor. The filesystem does not actually free the storage until all open file descriptors have been closed. These are very long-standing semantics.
You may wish to read more about file locking in Unix and Linux (e.g., the Wikipedia article on File Locking.) Basically, mandatory and advisory locks on Linux exist but they're not guaranteed to prevent what you want to prevent.