I have written a program that recursively reads files in one directory, modifies them and writes them into another directory. Everytime I run that program, it croakes after a couple of hundred iterations. I just run it again, and it seems to complete the task.
Either nodejs or Mac OS X or, most likely, nodejs-on-Mac-OS-X, seems to have some kind of limit on the number of files that can be opened at one time. Searching around, I see that a solution is to use something like ulimit -n 10480 and all will be well. Is that the right way? Instinctively, I'd rather not tinker with my system settings and rather modify my program to work within the limits.
An observation: Earlier I used to use Perl to do the task I've described above, and I never had a problem. I am assuming it was because I was opening, transforming, then closing the file, and then moving along. In nodejs, using async mode, I have no way of closing a file before going on to the next file. If I do the task in sync mode, it works fine.

You can use the async library with the limit commands to limit the number of files processed to a certain number. For example :
async.eachLimit(files, 1000, function (file, next) {
processFile(file, next);
}, done);
If you wish to process a single file before going to the next one just use eachSeries.
async.eachSeries(files, function (file, next) {
processFile(file, next);
}, done);

Yes, macOS (and possibly every UNIX variant) has a limit on the number of open files, and yes, Perl didn't have that problem for the reason that you mention.
ulimit is not a system setting the way that you seem to think about it. ulimit applies to the current process and is copied to its children processes when you start them, meaning that if you raise the limit in a process, you're not impacting the other processes, within the boundaries that if you're changing the limits on some globally-constrained resource like physical memory usage, you might be starving other programs. In other words, if you run ulimit -n 10480 in a shell, effects only last until you exit that shell.
On macOS, the actual upper ceiling of system-wide open files is given by the command sysctl kern.maxfiles. Regardless of ulimit settings, opening files will fail if you try to open more than that on your entire system at once. On my system, it's 12288. This is the "system setting" that tinkering with can have more lasting effects: raising it increases the static amount of memory that the kernel needs (by amounts unknown to me), and lowering it can starve processes from file descriptors.
If your script is relatively short-lived, raising the file descriptor limit using ulimit is probably not a problem.
I don't know about node.js though, and maybe (almost certainly) it has facilities to start just a number of async tasks at a time, so you could also do that.


How to read a [nonblocking] filedescriptor of a file that is appended to (aka, like tail -f)?

Actually, I am using libev; but under the hood this is using epoll (I'm only on linux). When I add a watcher to read a file and all data has been read then I do get a call back that there is data to read, but read(2) returns 0 (EOF). At that point I have to stop the watcher or else it will continue to tell me that there is something to read. However, if I stop the watcher and then some other process appends data to that file then I'll never see it.
What is the correct way to get notified that there is additional/appended data in a file that can be read when before I already read till the end?
I'd prefer the answer in terms of libev, but lower level will do too (I can then probably translate that to how to do that with libev).
It is very common, for some reason, for people to think that making an fd nonblocking, or calling poll/select/.. has different behaviour for files compared to other types of file descriptions, but nonblocking behaviour and I/O readyness behaviour is essentially the same for all of types of file descriptions: the kernel will immediately return from read/write etc. if the outcome is known, and will signal I/O readyness when this is the case. When a socket has an EOF condition, select will signal that the socket is ready to read, and you will get 0 (for EOF). The same happens for files - if you are at the end of a file, the kernel will return immediately from read and return 0 to signal EOF.
The important difference is that files can change contents at random places, and can be extended. Pipes and sockets are not random access and cannot be appended to once closed. Thus, while the behaviour is consistent, it is often not what is wanted, namely waiting for a file to change in some way.
The conflict in many people's minds is simply that they want to be told "when there is new data", but if you think about it a bit, you will realise that simply waking you up would not be an adequate interface for this, as you have no way of knowing why you woke up, and what changed.
POSIX doesn't have an interface to do that, other than regularly polling the fd or file (and in case of random changes, regularly reading the whole file!). Some operating systems have an interface to do something similar to that (kqueue on BSDs, inotify on GNU/Linux) , but they are usually not a perfect match, either (for example, inotify cannot watch an fd for changes, it will watch a path for changes).
The closest you can get with libev is to use an ev_stat watcher. It behaves as if you would stat() a path regularly, and invoke the watcher callback whenever the stat data changes. Portably, it does just that: it regularly calls stat, but on some operating systems (currently only inotify on GNU/Linux, as kqueue doesn't have correct semantics for this) it can use other mechanisms to speed this up in some cases, although it will fall back to regular stat polling everywhere, for example for when the file is on a network file system, where inotify can't see remote changes.
To answer your question: If you have a path, you can use an ev_stat watcher to watch for stat data changes, such as size/mtime etc. changes. Doing this right can be a bit tricky (see the libev documentation, especially the part about stat time resolution:, and you have to keep in mind that this watches a path, not a file descriptor, so you might want to compare the device/inode of your file descriptor and the watched path regularly to see if you still have the correct file open.
This still doesn't tell you what part of the file has changed.
Alternatively, since you apparently only want to read appended data, you could opt to just read() the file regularly (in an ev_timer callback) and do away with all the complexity and hassles of an ev_stat watcher setup (while not forgetting to also compare the path stat data with your fd stat data to see if you still hasve the right file open, depending on whether the file your are reading might get renamed or replaced. Sometimes programs also truncate files, something you can also detect by seeing the size decrease between stat calls).
This is essentially what older tail -f implementations do, while newer ones might, for example, take hints (only) from inotify, just like ev_stat watchers do.
None of that is easy, and details depend on your knowledge of how exactly the file changes, but it's the best you can do.

Linux ~/.bashrc export most recent directory

I have several environment variables in my ~/.bashrc that point to different directories. I am running a program that creates a new folder every time that it runs and puts a time stamp in the directory name. For example, baseline_2015_11_10_15_40_31-model-stride_1-type_1. Is there away of making a variable that can link to the last created directory?
Your mileage may vary a lot depending on what exactly do you need to accomplish. However, it almost all cases I would advise against doing something that weird and unreliable like what's described below and revise your architecture to avoid hunting for directories.
Method 1
If your program creates a subdirectory inside current directory, and you always know that nothing else happens in that directory and you want a subdirectory with latest creation timestamp, then you can do something like:
TARGET_DIR=$(ls -t1 --group-directories-first | head -n1)
Method 2
If a lot of stuff happens on the system, then you'll end up monitoring what your program does with the filesystem and reacting when it creates a directory. There are two ways to do that, using strace and inotify, both are relatively complex. Here's the way to do that with strace:
strace -o some_temp_file.strace your_complex_program_that_creates_dir
TARGET_DIR=$(sed -ne '/^mkdir(/ { s/^mkdir("\(.*\)", .*).*$/\1/; p }' some_temp_file.strace
This snippet runs your_complex_program_that_creates_dir under control of strace, which essentially logs every system call your program makes into a file. Afterwards, this file is analyzed to seek a line like
mkdir("target_dir", 0777) = 0
and extract value of "target_dir" into a variable. Note that:
if your program creates more than 1 directory (even for temporary purposes and deletes them afterwards, or whatever) — there's really no way to determine which of them to grab
running a program with strace is much slower that normal due to huge overhead of logging all the syscalls.
it's super non-portable — facilities like strace exist on most modern OS, but implementations will vary a lot
A solution with inotify works in the same way, but using different mechanism — i.e. it uses OS hook to log all the operations that process performs with file system and then react to it (remember created directory).
However, I repeat, I'd strongly suggest against using any of these solutions beyond research interest.

Retrieving a list of all file descriptors (files) that a process ever opened in linux

I would like to be able to get a list of all of the file descriptors (now considering this question to pertain to actual files) that a process ever opened during the runtime of the process. The problem with polling /proc/(PID)/fd/ is that you only get a snapshot in time of what is currently open. Is there a way to force linux to keep this information around long enough to log it for the entire run of the process?
First, notice that a file descriptor which is open-ed then close-d by the application is recycled by the kernel (a future open could give the same file descriptor). See open(2) and close(2) and read Advanced Linux Programming.
Then, consider using strace(1); you'll be able to log all the syscalls (or perhaps just open, socket, close, accept, ... that is the syscalls changing the file descriptor table). Of course strace is using the ptrace(2) syscall (which you probably don't want to bother using directly).
The simplest way would be to run strace -o /tmp/ yourprog argments... and to look, e.g. with some pager like less, into the quite big /tmp/ file.
As Gearoid Murphy commented you could restrict the output of strace using e.g. -e trace=file.
BTW, to debug Makefile-s this is the wrong approach. Learn more about remake.

Buffering on top of VFS

the problem I try to deal with it is the saving of big number (millions) of small files (up to 50KB), which are sent via network. The saving is done sequential: server receives a file or a dir (via network), it saves it on disk; the next one arrives, it's saved etc.
Apparently, the performance is not acceptable, if multiple server processes coexist (let's say I have 5 processes which all read from network and write at the same time), because the I/O scheduler doesn't manage to merge efficiently the I/O writes.
A suggested solution is to implement some sort of buffering: each server process should have a 50MB cache, in which it should write the current file, do a chdir etc; when the buffer is full, it should be synced to disk, therefore obtaining an I/O burst.
My questions to you:
1) I know that already exists a buffer mechanism (disk buffer); do you think that the above scenario is going to add some improvement? (the design is much more complicated and it's not easy to implement a simple test case)
2) do you have any suggestions, where to look if I would implement this?
Many thanks.
You're going to need to do better than
"apparently the performance is not acceptable".
How are you measuring it? Do you have an exact, reproducible figure
What is your target?
In order to do optimisation, you need two things- a method of measuring it (a metric) and a target (so you know when to stop, or how useful or useless a particular technique is).
Without either, you're sunk, I'm afraid.
How important are those writes? I have three suggestions (which can be combined), but one of them is a lot of work, and one of them is less safe...
I'm guessing you're seeing some poor performance due in part to the journaling common to most modern Linux filesystems. The journaling causes barriers to be inserted into the IO queue when file metadata is written. You can try turning down the safety (and maybe turning up the speed) with mount(8) options barrier=0 and data=writeback.
But if there is a crash, the journal might not be able to prevent a lengthy fsck(8). And there's a chance the fsck(8) will wind up throwing away your data when fixing the problem. On the one hand, it's not a step to take lightly, on the other hand, back in the old days, we ran our ext2 filesystems in async mode without a journal both ways in the snow and we liked it.
IO Scheduler elevator
Another possibility is to swap the IO elevator; see Documentation/block/switching-sched.txt in the Linux kernel source tree. The short version is that deadline, noop, as, and cfq are available. cfq is the kernel default, and probably what your system is using. You can check:
$ cat /sys/block/sda/queue/scheduler
noop deadline [cfq]
The most important parts from the file:
As of the Linux 2.6.10 kernel, it is now possible to change the
IO scheduler for a given block device on the fly (thus making it possible,
for instance, to set the CFQ scheduler for the system default, but
set a specific device to use the deadline or noop schedulers - which
can improve that device's throughput).
To set a specific scheduler, simply do this:
echo SCHEDNAME > /sys/block/DEV/queue/scheduler
where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
device name (hda, hdb, sga, or whatever you happen to have).
The list of defined schedulers can be found by simply doing
a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
will be displayed, with the currently selected scheduler in brackets:
# cat /sys/block/hda/queue/scheduler
noop deadline [cfq]
# echo deadline > /sys/block/hda/queue/scheduler
# cat /sys/block/hda/queue/scheduler
noop [deadline] cfq
Changing the scheduler might be worthwhile, but depending upon the barriers inserted into the queue by the journaling requirements, there might not be much reordering possible. Still, it is less likely to lose your data, so it might be the first step.
Application changes
Another possibility is to drastically change your application to bundle files itself, and write fewer, larger, files to disk. I know it sounds strange, but (a) the iD development team packaged their maps, textures, objects, etc., into giant zip files that they would read into the program with a few system calls, unpack, and run with, because they found the performance much better than reading a few hundred or few thousand smaller files. Load times between levels was drastically shorter. (b) The Gnome desktop team and KDE desktop teams took different approaches to loading their icons and resource files: the KDE team packages their many small files into larger packages of some sort, and the Gnome team did not. The Gnome team had longer startup delays and were hoping the kernel could make some efforts to improve their startup time. The kernel team kept suggesting the fewer, larger, files approach.
Creating/renaming a file, syncing it, having lots of files in a directory and having lots of files (with tail waste) are some of the slow operations in your scenario. However to avoid them it would only help to write lesser files (for example writing out archives, concatenated file or similiar). I would actually try a (limited) parallel async or sync approach. The IO scheduler and caches are typically quite good.

how does kernel handle new file creation

I wish to understand the way kernel works when a user/app tries to create a file in a directorty.
The background - We have a java applicaiton which consumes messages over JMS, processes it and then writes the XML to an outbound queue+a local directory. Yesterday we obeserved unsual delays in writing to the directory. On 'ls|wc -l' we found >300,000 files in there. Did a quick strace on the process and found it full of mutex calls (More than 3/4 calls in the strace were mutex).
So i thought that new file creation is taking time becasue the system has to every time check certain things (e.g name of files to make sure that the new file with a specific name can be created) amongst 300,000 files and then create a file.
I cleared the directory and the applicaiton resumed to normal service levels.
My questions
Was my analysis correct (It seems cuz the app started working fine after a clear down)?
More imporatant, how does the kernel work when you try to creat a new file in directory.
Can the abnormal number of mutex calls be attributed to the high number of files in the directory?
Many thanks
Please read about the Linux Filesystem, i-nodes and d-nodes.
The file system is organized into fixed-sized blocks. If your directory is relatively small, it fits in the direct blocks and things are fast. If your directory is not too big, it fits in the direct blocks and some indirect blocks, and is still reasonably fast. If your directory becomes too big, it spills into double indirect blocks and becomes slow.
Actual sizes depend on file system and kernel configuration.
Rule of thumb is to keep the directory under 12 blocks, depending on your block size. Many systems use 8K blocks; a fast directory is under 98,304 bytes.
A file entry is something like 16*4 bytes in size (IIRC), so plan on no more than 1500 files per directory as a practical upper limit.
Directories with large numbers of entries are often slow - how slow depends on the underlying filesystem.
The common solution is to create a hierarchy of directories, so each dir only has a few hundred entries.
Mutex system calls are a result of the application (probably something in the JVM or the Java libraries) making mutex calls.
Synchronisation internal to the kernel you will not see via strace, as this only examines system calls themselves.
A directory with lots of files should not become inefficient if you are using a filesystem which uses directory indexes; most now do (ext3 does optionally but it's normally enabled nowadays).
Non-indexed directories (like those used on the bad old filesystems - ext2, vfat etc) get really bad with lots of files, and you'll see the "open" system call taking a lot longer.
