NFS time gap too long - linux

I've got 2 machines that exchange data via NFS: 2 different files of about 20 bytes. The client writes its file and the server reads and deletes it then it writes its different file and the server reads and deletes. And so on. The 2 files have always the same names.
It was all ok. They run Linux 2.4. Nowadays, I've added another client which runs Linux 2.6. It works in the same way, it only uses files with different names.
The problem is that the new client sees the file from the server about 40 seconds after that it is written. I can wait 4-5 or even 10 seconds, but not 40.
I've tried to mount the remote partition with -o vers=2 or -o vers=3, but with no effects.
Then I tried echo 3 > /proc/sys/vm/drop_caches, (see NFS cache-cleaning command?) no effects.
What can I do to reduce the time gap?

You can try to incorporate listen-notify approach, using iNotify to monitor filesystem events.
The inotify API provides a mechanism for monitoring file system
events. Inotify can be used to monitor individual files, or to monitor
directories. When a directory is monitored, inotify will return events
for the directory itself, and for files inside the directory
man page
Q: Can I watch sysfs (procfs, nfs...)? Simply spoken: yes, but with
some limitations. These limitations vary between kernel versions and
tend to get smaller. Please read information about particular
filesystems.
FAQ page
It's very likely this will decrease the time gap.

Related

What is the performance impact of the touch command or its equivalent system call?

In a custom-developed NodeJS web server (running on Linux) that can dynamically generate thumbnail images, I want to cache these thumbnails on the filesystem and keep track of when they are actually used. If they haven't been used for a certain period of time (say, one year), I'd consider them "orphans" and delete them.
To this end, I considered to touch them each time they're requested from a client, so that I can use the modification time to check when they were last used.
I assume this would incur a significant performance hit on the web server in high-load situations, as it is an "unnecessary" filesystem write, while, apart from logging, most requests will only consist of reads.
Has anyone performed any benchmarks on how big an impact this might have and if it's worthwhile?
It's probably not great, and probably worth avoiding updating every time you open a file. That's the reason the relatime / noatime mount options were invented, to prevent the existing Unix access-time timestamp from being updated every time a file was opened.
Is your filesystem mounted with relatime? That updates atime at most once per day, when the file is opened (even for reading). The other mount option that's common on Linux is noatime: never update atime.
If you can't let the kernel do this for you without needing extra system calls, you might be better off making an fstat system call after opening the file and only touching it to update the mod time if the mod time is older than a day or week. (You're concerned about intervals of a year, so a week is fine.) i.e. manually implement the relatime logic, but for mod time.
Frequently accessed files will not need updates (and you're still making a total of one system call for them, plus a date-compare). Rarely accessed files will need another system call and a metadata write. If most of the accesses in your access pattern are to a smallish set of files repeatedly, this should be excellent.
Possible reasons for not being able to use atime could include:
The filesystem is mounted with noatime and it's not worth changing that
The files sometimes get read by something other than your web server / CGI setup. (e.g. a backup job that does more than compare size / timestamps)
Of course, the other option is to not update timestamps on use, and simply let a thumbnail be regenerated once a year after your weekly cron job deleted it. That might be ok depending on your workload.
If you manually touch some of the "hottest" thumbnails so you stagger their deletion, instead of having a big load spike this time next year, you could be ok. And/or have your deleter walk your filesystem very slowly, again so you don't have a big batch of frequently-needed thumbnails deleted at once.
You could come up with schemes like enabling mod-time updates in the week before the bi-annual cleanup, so thumbnails that should stay hot in cache get their modtimes updated. But probably better to just fstat / check / update all the time since that shouldn't be too much extra load.

Mirroring files from one partition to another on the second disk without RAID1

I am looking for a program that would allow me to mirror one partition to another disk (something like RAID1) for Linux. It doesn't have to be a windowed application, it can be a console application, I just want what is in one place to be mirrored to another.
It would be nice if it were possible to mirror a specific folder that I would care for instead of copying everything from the given partition.
I was looking on the internet, but it's hard to find something that would give such opportunities, hence the idea to ask such a question.
I do not want to make fake RAID on Linux or hardware RAID because I read that if the motherboard fails then it is best to have the same second one to recover data.
I will be grateful for every suggestion :)
You can check my script "CopyDirFile" written in bash, which is located on github.
You can perform a replication (mirroring) task of any source folder to another destination folder (deleting a file in the source folder means deleting it in the destination folder).
The script also allows you to create copy tasks (deleted files in the source folder will not be deleted in the target folder).
The tasks are executed in background at a specified time, not all the time, frequency is set by the user when creating the task.
You can also set the task to start automatically when the user logs on.
All the necessary information can be found in the README file in repository.
If I understood you correctly, I think it meets your requirements.
Linux has standard support for software RAID: mdraid.
It allows you to bundle two disk devices into a RAID 1 device (among other things); you then create a filesystem on top of that device.
LVM offers another way to do software RAID; it doesn't seem to be very popular, but it's certainly supported.
(If your system supports hardware RAID, on the motherboard or with a separate RAID controller, Linux can use that, too, but that doesn't seem to be what you're asking here.)

Distributed Processing of Volumetric Image Data

For the development of an object recognition algorithm, I need to repeatedly run a detection program on a large set of volumetric image files (MR scans).
The detection program is a command line tool. If I run it on my local computer on a single file and single-threaded it takes about 10 seconds. Processing results are written to a text file.
A typical run would be:
10000 images with 300 MB each = 3TB
10 seconds on a single core = 100000 seconds = about 27 hours
What can I do to get the results faster? I have access to a cluster of 20 servers with 24 (virtual) cores each (Xeon E5, 1TByte disks, CentOS Linux 7.2).
Theoretically the 480 cores should only need 3.5 minutes for the task.
I am considering to use Hadoop, but it's not designed for processing binary data and it splits input files, which is not an option.
I probably need some kind of distributed file system. I tested using NFS and the network becomes a serious bottleneck. Each server should only process his locally stored files.
The alternative might be to buy a single high-end workstation and forget about distributed processing.
I am not certain, if we need data locality,
i.e. each node holds part of the data on a local HD and processes only his
local data.
I regularly run large scale distributed calculations on AWS using Spot Instances. You should definitely use the cluster of 20 servers at your disposal.
You don't mention which OS your servers are using but if it's linux based, your best friend is bash. You're also lucky that it's a command line programme. This means you can use ssh to run commands directly on the servers from one master node.
The typical sequence of processing would be:
run a script on the Master Node which sends and runs scripts via ssh on all the Slave Nodes
Each Slave Node downloads a section of the files from the master node where they are stored (via NFS or scp)
Each Slave Node processes its files, saving required data via scp, mysql or text scrape
To get started, you'll need to have ssh access to all the Slaves from the Master. You can then scp files to each Slave, like the script. If you're running on a private network, you don't have to be too concerned about security, so just set ssh passwords to something simple.
In terms of CPU cores, if the command line program you're using isn't designed for multi-core, you can just run several ssh commands to each Slave. Best thing to do is run a few tests and see what the optimal number of process is, given that too many processes might be slow due to insufficient memory, disk access or similar. But say you find that 12 simultaneous processes gives the fastest average time, then run 12 scripts via ssh simultaneously.
It's not a small job to get it all done, however, you will forever be able to process in a fraction of the time.
You can use Hadoop. Yes, default implementation of FileInputFormat and RecordReader are splitting files into chunks and split chunks into lines, but you can write own implementation of FileInputFormat and RecordReader. I've created custom FileInputFormat for another purpose, I had opposite problem - to split input data more finely than default, but there is a good looking recipes for exactly your problem: https://gist.github.com/sritchie/808035 plus https://www.timofejew.com/hadoop-streaming-whole-files/
But from other side Hadoop is a heavy beast. It has significant overhead for mapper start, so optimal running time for mapper is a few minutes. Your tasks are too short. Maybe it is possible to create more clever FileInputFormat which can interpret bunch of files as single file and feed files as records to the same mapper, I'm not sure.

Multiple Machines -- Process Many Files Concurrently?

I need to concurrently process a large amount of files (thousands of different files, with avg. size of 2MB per file).
All the information is stored on one (1.5TB) network hard drive, and will be accessed (read) by about 30 different machines. For efficiency, each machine will be reading (and processing) different files (there are thousands of files that need to be processed).
Every machine -- following its reading of a file from the 'incoming' folder on the 1.5TB hard drive -- will be processing the information and be ready to output the processed information back to the 'processed' folder on the 1.5TB drive. the processed information for every file is of roughly the same average size as the input files (about ~2MB per file).
Are there any 'do' and 'donts' when one is building such an operation? is it a problem to have 30 machines or so read (or write) information to the same network drive, at the same time?
(note: existing files will only be read, not appended/written; new files will be created from scratch, so there are no issues of multiple access to the same file...).
Are there any bottlenecks that I should expect?
(I am use Linux, Ubuntu 10.04 LTS on all machines if it all matters)
Things you should think about:
If the processing to be done for each file is simple, then your real bottleneck isn't the amount of parallel files you read, but the capabilities of the hard disk drive.
Unless processing takes a long time (say, some seconds per file) you'll go past a point in which adding more processes will only slow down matters to a crawl, since every process is reading and writing results, and the disk can only do so much.
Try to minimize disk access: for example, download files and produce results locally while other processes are downloading, and send the results back when the load on the disk goes down.
The more I write the more it boils down to how much processing needs to be done for each file. If it's simple parsing, something that takes milliseconds, 1 machine or 30 will make little difference.
You need to be careful that two worker processes don't pick up (and try to do) the same piece of work at the same time.
Unfortunately, NFS filesystems don't have semantics that allow you to easily do that.
So what I'd recommend is to use something like Gearman and a producer/consumer model, where one process gives out work to whoever is available to do it.
Another possibility is to have a database (e.g. mysql) with a table of all tasks, and have the processes atomically "claim" tasks for themselves.
But all of this is only worthwhile if your processes are mostly CPU-bound. If you're trying to get more IO bandwidth (or operations) out of your NAS by using multiple clients, it's not going to work.
I am assuming that you will be running at least gigabit ethernet here (or it's probably not worth it).
Have you tried running multiple processes on the same machine?

how does kernel handle new file creation

I wish to understand the way kernel works when a user/app tries to create a file in a directorty.
The background - We have a java applicaiton which consumes messages over JMS, processes it and then writes the XML to an outbound queue+a local directory. Yesterday we obeserved unsual delays in writing to the directory. On 'ls|wc -l' we found >300,000 files in there. Did a quick strace on the process and found it full of mutex calls (More than 3/4 calls in the strace were mutex).
So i thought that new file creation is taking time becasue the system has to every time check certain things (e.g name of files to make sure that the new file with a specific name can be created) amongst 300,000 files and then create a file.
I cleared the directory and the applicaiton resumed to normal service levels.
My questions
Was my analysis correct (It seems cuz the app started working fine after a clear down)?
More imporatant, how does the kernel work when you try to creat a new file in directory.
Can the abnormal number of mutex calls be attributed to the high number of files in the directory?
Many thanks
J
Please read about the Linux Filesystem, i-nodes and d-nodes.
http://en.wikipedia.org/wiki/Inode_pointer_structure
The file system is organized into fixed-sized blocks. If your directory is relatively small, it fits in the direct blocks and things are fast. If your directory is not too big, it fits in the direct blocks and some indirect blocks, and is still reasonably fast. If your directory becomes too big, it spills into double indirect blocks and becomes slow.
Actual sizes depend on file system and kernel configuration.
Rule of thumb is to keep the directory under 12 blocks, depending on your block size. Many systems use 8K blocks; a fast directory is under 98,304 bytes.
A file entry is something like 16*4 bytes in size (IIRC), so plan on no more than 1500 files per directory as a practical upper limit.
Directories with large numbers of entries are often slow - how slow depends on the underlying filesystem.
The common solution is to create a hierarchy of directories, so each dir only has a few hundred entries.
Mutex system calls are a result of the application (probably something in the JVM or the Java libraries) making mutex calls.
Synchronisation internal to the kernel you will not see via strace, as this only examines system calls themselves.
A directory with lots of files should not become inefficient if you are using a filesystem which uses directory indexes; most now do (ext3 does optionally but it's normally enabled nowadays).
Non-indexed directories (like those used on the bad old filesystems - ext2, vfat etc) get really bad with lots of files, and you'll see the "open" system call taking a lot longer.

Resources