open() does it load a file onto the ram? - linux

In Linux when I call open() on a file, is the complete data loaded on the ram?
I am asking this because the file i need to open is a huge file(around 700MB). Such a large file, if it is getting loaded, it might create a problem for other applications and making the system slower.
If complete data is not loaded(because on demand paging or similar things), then what can we assume to be the initial size that is loaded?
Just if it matters, here are my file details.
File type : Binary
File Size : varies in between 700MB - 2GB

No it doesn't. It just gives file a descriptor called file descriptor which then you can use to do read / write and some other operations. Think file descriptor as an abstraction or an handle over what lays on disk.
You should also read manual page for open. It states
The open() function shall establish the connection between a file and
a file descriptor. It shall create an open file description that
refers to a file and a file descriptor that refers to that open file
description.

In Linux when I call open() on a file, is the complete data loaded on the ram?
No.
what can we assume to be the initial size that is loaded?
Zero.

Related

If threading shares the same file descriptor table, how does each thread read a different part of the file?

I understand that threads share almost everything in the PCB (except PC and stack), including the file descriptor table. The file descriptor table entry is a pointer to the system wide open file table, where each entry has a Op.Type, File offset and File Data. If a process is opening a file and creates multiple threads reading from the same file (file descriptor) using system call read, why will each threads read a different part of the file? (Given they access the same file descriptor in the same table, and thus same file and same offset?)
Answer from Kaylum:
If the threads are using the same file descriptor then if any thread does a read it will move the file offset for all threads. Hence when another thread does a read it will not be the same as what the first thread read but rather continues from where the first thread stopped reading.
So they do access the same entry on the system wide file table, but since each thread changes the file offset for all the threads, they do not read the same part of the file.

How safe is it reading / copying a file which is being appended to?

If a log file has events constantly being appended to it, how safe is it to read that file (or copy it) with another process?
Unix allows concurrent reading and writing. It is totally safe to read a file while others are appending to it.
Of course it can happen that an appending act is unfinished while a reading act is reaching the end of the file, then this reader will get an incomplete version (e. g. only a part of a new log entry at the end of the file). But technically, this is correct because the file really was in this state while it was being read (e. g. copied).
EDIT
There's more to it.
If a writer process has an open file handle, the file will stay on disk as long as this process keeps the open file handle.
If you remove the file (rm(1), unlink(2)), it will be removed from its directory only. It will stay on disk, and that writer (and everybody else who happens to have an open file handle) will still be able to read the contents of the already removed file. Only after the last process closes its file handle, the file contents will be freed on the disk.
This is sometimes an issue if a process writes a large log file which is filling up the disk. If it keeps and open file handle to the log file, the system administrator cannot free this disk capacity using rm.
A typical approach then is to kill the process as well. Hence it is a good idea, as a process, to close the file handle for the log file again after writing to the log (or close and reopen it at least from time to time).
There's more:
If a process has a an open file handle on a log file, this file handle contains a position. If now the log file is emptied (truncate(1), truncate(2), open(2) for writing not using append flags, : > filepath), the file's contents is indeed removed from the disk. If the process having an open file handle is now writing to this file, it will write at the old position, e. g. at a position of several megabytes. Doing this to an empty file will fill the gap with zeros.
This is no real problem, if a sparse file can be created (typically possible on Unix file systems). Only otherwise will it fill the disk again quickly. But in any case it can be very confusing.

Syncing a file system that has no file on it

Say I want to synchronize data buffers of a file system to disk (in my case the one of an USB stick partition) on a linux box.
While searching for a function to do that I found the following
DESCRIPTION
sync() causes all buffered modifications to file metadata and
data to be written to the underlying file sys‐
tems.
syncfs(int fd) is like sync(), but synchronizes just the file system
containing file referred to by the open file
descriptor fd.
But what if the file system has no file on it that I can open and pass to syncfs? Can I "abuse" the dot file? Does it appear on all file systems?
Is there another function that does what I want? Perhaps by providing a device file with major / minor numbers or some such?
Yes I think you can do that. The root directory of your file system will have at least one inode for your root directory. You can use the .-file to do that. Play also around with ls -i to see the inode numbers.
Is there a possibility to avoid your problem by mounting your file system with sync? Does performance issues hamper? Did you have a look at remounting? This can sync your file system as well in particular cases.
I do not know what your application is, but I suffered problems with synchronization of files to a USB stick with the FAT32-file system. It resulted in weird read and write errors. I can not imagine any other valid reason why you should sync an empty file system.
From man 8 sync description:
"sync writes any data buffered in memory out to disk. This can include (but is not
limited to) modified superblocks, modified inodes, and delayed reads and writes. This
must be implemented by the kernel; The sync program does nothing but exercise the sync(2)
system call."
So, note that it's all about modification (modified inode, superblocks etc). If you don't have any modification, it don't have anything to sync up.

In linux , how to create a file descriptor for a memory region

I have some program handling some data either in a file or in some memory buffer. I want to provide uniform way to handle these cases.
I can either 1) mmap the file so we can handle them uniformly as a memory buffer; 2) create FILE* using fopen and fmemopen so access them uniformly as FILE*.
However, I can't use either ways above. I need to handle them both as file descriptor, because one of the libraries I use only takes file descriptor, and it does mmap on the file descriptor.
So my question is, given a memory buffer (we can assume it is aligned to 4K), can we get a file descriptor that backed by this memory buffer? I saw in some other question popen is an answer but I don't think fd in popen can be mmap-ed.
You cannot easily create a file descriptor (other than a C standard library one, which is not helpful) from "some memory region". However, you can create a shared memory region, getting a file descriptor in return.
From shm_overview (7):
shm_open(3)
Create and open a new object, or open an existing object. This is analogous to open(2). The call returns a file descriptor for use by the other interfaces listed below.
Among the listed interfaces is mmap, which means that you can "memory map" the shared memory the same as you would memory map a regular file.
Thus, using mmap for both situations (file or memory buffer) should work seamlessly, if only you control creation of that "memory buffer".
You could write (perhaps using mmap) your data segment to a tmpfs based file (perhaps under /run/ directory), then pass the opened file descriptor to your library.

Multiple file descriptors to the same file, C

I have a multithreaded application that is opening and reading the same file (not writing). I am opening a different file descriptor for each thread (but they all point to the same file). Each thread then reads the file and may close it and open it again if EOF is reached. Is this ok? If I perform fclose() on a file descriptor does it affect the other file descritptors that point to the same file?
For Linux systems you don't need multiple file descriptors to do this. You can share a single file descriptor and use pread to atomically do a seek / read operation without modifying the file descriptor at all.
That's ok. You can open all times you want the same file and each file descriptor will be independent from each other.
That should work fine, provided each thread has its own file handle. Since you mention use of fclose(), that suggests you are also using fopen() in each thread and each thread only affects its own FILE * variable.
Is there a problem?

Resources