Shake: Signal whether anything had to be rebuilt at all

Shake: Signal whether anything had to be rebuilt at all - haskell

I use shake to build a bunch of static webpages, which I then have to upload to a remote host, using sftp. Currently, the cronjob runs
git pull # get possibly updated sources
./my-shake-system
lftp ... # upload
I’d like to avoid running the final command if shake did not actually rebuild anything. Is there a way to tell shake “Run command foo, after everything else, and only if you changed something!”?
Or alternatively, have shake report whether it did something in the process exit code?
I guess I can add a rule that depends on all possibly generated file, but that seems to be redundant and error prone.

Currently there is no direct/simple way to determine if anything built. It's also not such a useful concept as for simpler build systems, as certain rules (especially those that define storedValue to return Nothing) will always "rerun", but then very quickly decide they don't need to run the rules that depend on them. To Shake, that is the same as rerunning. I can think of a few approaches, which one is best probably depends on your situation:
Tag the interesting rules
You could tag each interesting rule (one that produces something that needs uploading) with a function that writes to a specific file. If that specific file exists, then you need to upload. This might work slightly better, as if you do multiple Shake runs, and in the first something changes but the second nothing does, the file will still be present. If it makes sense, use an IORef instead of a file.
Use profiling
Shake has quite advanced profiling. If you pass shakeProfile=["output.json"] it will produce a JSON file detailing what built and when. Runs are indexed by an Int, with 0 for the most recent run, and any runs that built nothing are excluded. If you have one rule that always fires (e.g. write to a dummy file with alwaysRerun) then if anything fired at the same time, it rebuilt.
Watch the .shake.database file size
Shake has a database, stored under shakeFiles. Each uninteresting run it will grow by a fairly small amount (~100 bytes) - but a fixed size given your system. If it changes in size by a greater amount, then it did something interesting.
Of these approaches, tagging the interesting rules is probably the simplest and most direct (although does run the risk of you forgetting to tag something).

Related

Tracing Node.js program execution

I'd like to trace the execution path of an arbitrary Node.js program.
Specifically, I'd like to run a program (server or script), and have some sort of block-level (function call, loop, if statement) trace of the execution.
Constraints
Output must contain files / lines / hit count for all lines during execution
No code minification. Istanbul is great but I want to keep the code that is executed in the end as readable as possible.
For long-running processes (servers, for example), I want to be able to see "current" line coverage (or as up-to-date as possible)
I don't want to lose any coverage data, so while Profiling would give me some hints as to lines hit, it's not really code coverage.
Things I don't care about
Exactly how the coverage is read. For example, it could be output to a file, it could be read via the code, etc.
Coverage format
Thing's I've investigated so far:
Using NODE_V8_COVERAGE:
I found that if I set the NODE_V8_COVERAGE environment variable to a directory, coverage data will be output to that directory when the program exits (here's a blog post on the creation of this feature).
The problem that I'm facing here is that I'm not sure there's a way to trigger the generation of these reports before the program terminates.
Using inspector
I have also been experimenting with Node.js inspector. I found a useful CPU profiler here. This could end up being helpful, but this profiler works by sampling, not as a hook into the language. As a result, I only get line numbers / counts for parts of the code that were slow.
I also tried using Profiler.startPreciseCoverage, thinking that somehow this might give me every line that was executed (didn't find the documentation to be clear on what this does really). It didn't seem to be any more useful
Using Istanbul
I would like to avoid instrumenting code if possible.
Question
It seems like my options are limited, but at the same time this is only a result of my Googling for an hour or two.
Is there a better way to capture line coverage with the constraints listed above?

There's a pull request pending for Node.js to add functionality to programmatically start/stop/write V8 coverage information. If you are adventurous, you could use git to get the version of Node.js you want to use, apply the commits from the patch, and compile a Node.js binary.
If you clone the Node.js repository, the various versions of Node.js are tagged. So you can get the code for Node.js 12.19.0 by checking out the v12.19.0 tag.
You can cherry-pick the commits from the pull request normally, or you could use curl -L https://github.com/nodejs/node/pull/33807.patch | git am to apply the commits as patches.
Instructions for compiling/building the Node.js binary can be found at https://github.com/nodejs/node/blob/master/BUILDING.md#building-nodejs-on-supported-platforms.
More long term, you could chime in on the pull request on whether it meets your needs or not and hopefully get it going again. It seems to have stalled.

How do I monitor changes of files and only look at them when the changes are finished?

I'm currently monitoring files in node.js using fs.watch. The problem I have is for example, let say I copy a 1gig file into a folder I'm watching. The moment the file starts copying I get a notification about the file. If I start reading it immediately I end up with bad data since the file has not finished copying. For example a zip file has it's table of contents at the end but I'd end up reading it before it's table of contents has been written.
Off the top of my head I could setup some task to call fs.stat on the file every N seconds and only try to read it when the stats stop changing. That would work but it seems not ideal as I'd like my app to be as responsive as possible and calling stat on a bunch of files every second seems heavy as well as calling stat every 5 or 10 seconds seems unresponsive.
Is there some more robust way to get notified when a file has finished being modified?

So I did a project last year which required doing "file watching". There is a better library out there than fs.watch. Check out npm chokidar.
https://www.npmjs.com/package/chokidar
Underneath it uses fs.watch, but wraps better improvements around it.
There is a property called awaitWriteFinish. Really it's doing some polling on the file to determine whether or not the file is finished writing. I used it and it really works great.
Setting this property will allow you to work against that file, always ensuring that the file has been completely written. And you don't need to go off and implement your own method of determining if the file is complete. Should save a bunch of time.
Aside from that, I don't believe you can really get away from polling with regard to determining if a file is finished writing. Chokidar is still polling, it's just that you don't need to write the logic to do it. And you can configure the polling interval if CPU utilization is deemed to be too high.
Edit: Would also like to add, to just give it a shot and see how it works. I get you want it as responsive as possible... But having something working is better than having something not working at all. It might be that even with a polling solution it's not even an issue for you. If it's deemed a performance problem, then go address it at that time and seek a "better" solution.

"find" command cannot detect files added during execution

Stackoverflow has saved my life on countless occasions over the years. Now, it's time for me to post my first question ever, the answer to which I have been unable to find so far.
I have a tool (language/implementation is irrelevant) which accepts a text file as input. This text file (let's call it file_list.txt) contains a long list of file paths, one per line. The tool then iterates over the lines in file_list.txt and does something with every file path. This needs to be done continuously and file_list.txt needs to always contain the latest file paths because users continuously upload or delete files from the share being monitored. To achieve this, I have set up a cron job which calls a script. First the script calls the find utility with the search parameters required and pipes the output to a temporary file. When the file is fully populated, it is moved to file_list.txt. Then, once this is done, the tool is invoked with file_list.txt as an input parameter.
So far, so good. The share being monitored is VERY LARGE (~60 TB) and the find command takes around 5 hours to execute. This is not a problem since we have multiple overlapping find commands running in parallel (triggered once per hour). The entire setup runs on a compute farm, so CPU utilization, etc. is also not an issue.
The problem arises in the lag time for file detection. Ideally, I want a user to add a file and I want one of the already running, overlapping find commands to detect this file within a matter of minutes. However, I have noticed that none of the already-running find commands will detect this file. Only a find command started AFTER this file was added will detect it. This means that generally, I need to wait around 5 hours for a newly added file to be detected. This leads me to believe that the find utility somehow acts on a "cached" version of the share state when it was triggered. Is this true? Can anyone confirm this? And if so, what can I do to improve the detection lag?
Please let me know if further clarificaion is required. I am happy to provide any further details.

To summarize: you have a gigantic filesystem volume (60 TB) which contains a huge number of files, and you use find(1) to name a large number of those files and put those names into a text file for analysis. You have discovered that files are not listed if they are created after find(1) was started but before it finished.
I think the best solution is to stop thinking of this as a batch job, and do it "online" using inotify(7). You can use the inotify API to be immediately informed of changes to your filesystem, including new files being created. There is of course the original C API, as well as the excellent pyinotify.
With inotify, you can start a watcher program once and leave it running continuously (under a supervisor if needed for restarts). The operating system can then notify you whenever a relevant filesystem event occurs, and you can respond immediately rather than waiting for the next scan.
The one downside for your use case might be that the watcher program does need to run on a machine which has the filesystem mounted locally. But the overall compute resources required are probably much less than your current approach of repeated linear scans.

executing find commands and piping the output to temporary files might work up to a certain scale, but is far from optimal. If you want a less resource intensive, more reactive solution, I would recommend considering to reimplement your software using the inotify interface:
The inotify API provides a mechanism for monitoring filesystem events.
Inotify can be used to monitor individual files, or to monitor
directories. When a directory is monitored, inotify will return
events for the directory itself, and for files inside the directory.
So an event will be raised for each file change; or file being added.
Note that you can then keep an internal list of files up to date which only needs to be changed when you get a event.

Automatically adjusting process priorities under Linux

I'm trying to write a program that automatically sets process priorities based on a configuration file (basically path - priority pairs).
I thought the best solution would be a kernel module that replaces the execve() system call. Too bad, the system call table isn't exported in kernel versions > 2.6.0, so it's not possible to replace system calls without really ugly hacks.
I do not want to do the following:
-Replace binaries with shell scripts, that start and renice the binaries.
-Patch/recompile my stock Ubuntu kernel
-Do ugly hacks like reading kernel executable memory and guessing the syscall table location
-Polling of running processes
I really want to be:
-Able to control the priority of any process based on it's executable path, and a configuration file. Rules apply to any user.
Does anyone of you have any ideas on how to complete this task?

If you've settled for a polling solution, most of the features you want to implement already exist in the Automatic Nice Daemon. You can configure nice levels for processes based on process name, user and group. It's even possible to adjust process priorities dynamically based on how much CPU time it has used so far.

Sometimes polling is a necessity, and even more optimal in the end -- believe it or not. It depends on a lot of variables.
If the polling overhead is low-enough, it far exceeds the added complexity, cost, and RISK of developing your own style kernel hooks to get notified of the changes you need. That said, when hooks or notification events are available, or can be easily injected, they should certainly be used if the situation calls.
This is classic programmer 'perfection' thinking. As engineers, we strive for perfection. This is the real world though and sometimes compromises must be made. Ironically, the more perfect solution may be the less efficient one in some cases.
I develop a similar 'process and process priority optimization automation' tool for Windows called Process Lasso (not an advertisement, its free). I had a similar choice to make and have a hybrid solution in place. Kernel mode hooks are available for certain process related events in Windows (creation and destruction), but they not only aren't exposed at user mode, but also aren't helpful at monitoring other process metrics. I don't think any OS is going to natively inform you of any change to any process metric. The overhead for that many different hooks might be much greater than simple polling.
Lastly, considering the HIGH frequency of process changes, it may be better to handle all changes at once (polling at interval) vs. notification events/hooks, which may have to be processed many more times per second.
You are RIGHT to stay away from scripts. Why? Because they are slow(er). Of course, the linux scheduler does a fairly good job at handling CPU bound threads by downgrading their priority and rewarding (upgrading) the priority of I/O bound threads -- so even in high loads a script should be responsive I guess.

There's another point of attack you might consider: replace the system's dynamic linker with a modified one which applies your logic. (See this paper for some nice examples of what's possible from the largely neglected art of linker hacking).
Where this approach will have problems is with purely statically linked binaries. I doubt there's much on a modern system which actually doesn't link something dynamically (things like busybox-static being the obvious exceptions, although you might regard the ability to get a minimal shell outside of your controls as a feature when it all goes horribly wrong), so this may not be a big deal. On the other hand, if the priority policies are intended to bring some order to an overloaded shared multi-user system then you might see smart users preparing static-linked versions of apps to avoid linker-imposed priorities.

Sure, just iterate through /proc/nnn/exe to get the pathname of the running image. Only use the ones with slashes, the others are kernel procs.
Check to see if you have already processed that one, otherwise look up the new priority in your configuration file and use renice(8) to tweak its priority.

If you want to do it as a kernel module then you could look into making your own binary loader. See the following kernel source files for examples:
$KERNEL_SOURCE/fs/binfmt_elf.c
$KERNEL_SOURCE/fs/binfmt_misc.c
$KERNEL_SOURCE/fs/binfmt_script.c
They can give you a first idea where to start.
You could just modify the ELF loader to check for an additional section in ELF files and when found use its content for changing scheduling priorities. You then would not even need to manage separate configuration files, but simply add a new section to every ELF executable you want to manage this way and you are done. See objcopy/objdump of the binutils tools for how to add new sections to ELF files.

Does anyone of you have any ideas on how to complete this task?
As an idea, consider using apparmor in complain-mode. That would log certain messages to syslog, which you could listen to.

If the processes in question are started by executing an executable file with a known path, you can use the inotify mechanism to watch for events on that file. Executing it will trigger an I_OPEN and an I_ACCESS event.
Unfortunately, this won't tell you which process caused the event to trigger, but you can then check which /proc/*/exe are a symlink to the executable file in question and renice the process id in question.
E.g. here is a crude implementation in Perl using Linux::Inotify2 (which, on Ubuntu, is provided by the liblinux-inotify2-perl package):
perl -MLinux::Inotify2 -e '
use warnings;
use strict;
my $x = shift(#ARGV);
my $w = new Linux::Inotify2;
$w->watch($x, IN_ACCESS, sub
{
for (glob("/proc/*/exe"))
{
if (-r $_ && readlink($_) eq $x && m#^/proc/(\d+)/#)
{
system(#ARGV, $1)
}
}
});
1 while $w->poll
' /bin/ls renice
You can of course save the Perl code to a file, say onexecuting, prepend a first line #!/usr/bin/env perl, make the file executable, put it on your $PATH, and from then on use onexecuting /bin/ls renice.
Then you can use this utility as a basis for implementing various policies for renicing executables. (or doing other things).

How to write reliable file management code on NFS

Please give me some general advises on how to write reliable file management code using NFS. How to avoid or handle ESTALE errors? Programming language doesn't really matter.
Thanks.

Writing robust software is best done at the highest level possible.
So rather than handling a specific type of error in a specific place in your code, ensure that if the whole operation fails in some way, it can be rolled back / ignored safely and then will automatically re-run at a later time and do the work it missed because of the error.
For example, if you are writing out some files, you could write them into a temporary directory and rename the directory after the files are written successfully; moreover, if on a subsequent run, you discover the temporary directory still there, remove it (provided you're sure there are no other processes in the infrastructure using it still).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string