How can I run two bash scripts simultaneously and without repetition of the same action? - linux

I'm trying to write a script that automatically runs a data analysis program. The data analysis takes a file, analyzes it, and puts all the outputs into a folder. The program can be run on two terminals simultaneously (each analyzing a different subject file).
I wrote a script that can do all the inputs automatically. However, I can only get my script to run one automatically. If I run my script simultaneously it will analyze the same subject twice (useless)
Currently, my script looks like:
for name in `ls [file_directory]`
do
[Data analysis commands]
done
If you run this on two terminals, it will start from the top of the directory containing all the data files. This is a problem, so I tried to do checks for duplicates but they weren't very effective.
I tried a name comparison with the if command (didn't work because all the output files except one were of a unique name, so it would check the first outfput folder at the top of the directory and say the name was different even though an output folder further down had the same name). It looked something like..
for name in `ls <file_directory>`
do
for output in `ls <output directory>`
do
If [ name==output ]
then
echo "This file has already been analyzed."
else
<Data analyis commands>
fi
done
done
I thought this was the right method but apparently not. I would need to check all the names before some decision was made (rather one by one which that does)
Then I tried moving completed data files with the mv command (didn't work because "name" in the for statement stored all the file names so it went down the list regardless of what was in the folder at present). I remember reading something about how shell scripts do not do things in "real time" so it makes sense that this didn't work.
My thought was looking for some sort of modification to that if statement so it does all the name checks before I make a decision (how?)
Also are there any other commands I could possibly be missing that I could possibly try?

One pattern I use often is to use split command.
ls <file_directory> > file_list
split -d -l 10 file_list file_list_part
This will create files like file_list_part00 to file_list_partnn
You can then feed these file names to you script.
for file_part in `ls file_list_part*`
do
for file_name in `cat file_part | tr '\n' ' '`
do
data_analysis_command file_name
done
done

Never use "ls" in a "for" (http://mywiki.wooledge.org/ParsingLs)
I think you should use a fifo (see mkfifo)

As a follow-on from the comments, you can install GNU Parallel with homebrew:
brew install parallel
Then your command becomes:
parallel analyse ::: *.dat
and it will process all your files in parallel using as many CPU cores as you have in your Mac. You can also add in:
parallel --dry-run analyse ::: *.dat
to get it to show you the commands it would run without actually running anything.
You can also add in --eta (Estimated Time of Arrival) for an estimate of when the jobs will be done, and -j 8 if you want to run, say 8, jobs at a time. Of course, if you specifically want the 2 jobs at a time you asked for, use -j 2.
You can also have GNU Parallel simply distribute jobs and data to any other machines you may have available via ssh access.

Related

How to code for iterating through multiple files in linux?

I have code which I am trying to update from another example. The aim is to run plink using files of: each chromosome, snp ids, and a file containing only 1 ID which is an individual's ID. Running these files in plink ultimately makes a vcf file per individual for a given chromosome.
I have 22 chromosome files, 1 snp file (which is always the same), and 500 individual files. For each individual I am aiming to make a vcf for each chromosome, so I have 22*500 (11000) vcf files as output.
With doing this at the moment I have tried a bash script with this:
ID=$SGE_TASK_ID
indiv=$SGE_TASK_ID
plink --bed chr${ID}.bed --bim chr${ID}.bim --fam chr${ID}.fam --extract snps.txt
--recode vcf-iid --out output${indiv}chr${ID}vcf --keep-fam individual${indiv}.txt
This runs, however it only runs through 1 individual, giving me 22 chromosome vcf files for that one person, and stops there. How do I make this run for all 500 people, would it be with a for loop? Looking through other questions I haven't been able to find one that matches my question and is in linux, any help would appreciated.
${indiv} would just be a number, so the text file that runs looks like individual1.txt and increases through the 500 individuals (individual1.txt, individual2.txt, individual3.txt)
Assuming that ${indiv} contains no spaces,
for indiv in $(<individuals.data); do
plink [...] individual${indiv}.txt
done
The file individuals.data would name the individuals, separated by spaces or newlines.
If unsure what the Bash shell's $(<...) operator does, try this:
for A in $(<individuals.data); do
echo "[$A]"
done
Note that, as #Kaz has observed, if wish your script to work also in shells other than Bash, then you might write $(cat ...) rather than $(<...)

monitor linux processes for a long period of time and save it into a text file or csv file

I am running a stability test which involve several important processes I want to be able to monitor those processes individually (CPU,memory, IO, etc) , i know I can use TOP command but using this command will result in seeing only live metrics and now overall or average which I can derive into a graph and see how it was over time. how can i do that?
You can still use top, printing the output of a single instance to a file, then using grep to isolate the processes you want to see, and then using awk to select the fields you want.
Something like
top -n 1 -b > /tmp/log_top_running ; grep <process_name> /tmp/log_top_running | awk '{print $10}' >> <report_file>
will extract the process running time and append it to the report file. -b is to avoid escape chars in the file, -n 1 terminates top after the first refresh.
This is the most basic thing you can do - you can probably do something smarter by passing to top the flags to only print the stuff you want to see.
To have it execute regularly you can write this command in a script and use the watch command, setting an interval in seconds with the -n option. After you have your file you can plot it.
Hope it helps.

Add comments next to files in Linux

I'm interested in simply adding a comment next to my files in Linux (Ubuntu). An example would be:
info user ... my_data.csv Raw data which was sent to me.
info user ... my_data_cleaned.csv Raw data with duplicates filtered.
info user ... my_data_top10.csv Cleaned data with only top 10 values selected for each ID.
So sort of the way you can comment commits in Git. I don't particularly care about searching on these tags, filtering them etc. Just seeings them when I list files in a directory. Bonus if the comments/tags follow the document around as I copy or move it.
Most filesystem types support extended attributes where you could store comments.
So for example to create a comment on "foo.file":
xattr -w user.comment "This is a comment" foo.file
The attributes can be copied/moved with the file just be aware that many utilities require special options to copy the extended attributes.
Then to list files with comments use a script or program that grabs the extended attribute. Here is a simple example to use as a starting point, it just lists the files in the current directory:
#!/bin/sh
ls -1 | while read -r FILE; do
comment=`xattr -p user.comment "$FILE" 2>/dev/null`
if [ -n "$comment" ]; then
echo "$FILE Comment: $comment"
else
echo "$FILE"
fi
done
The xattr command is really slow and poorly written (it doesn't even return error status) so I suggest something else if possible. Use setfattr and getfattr in a more complex script than what I have provided. Or maybe a custom ls command that is aware of the user.comment attribute.
This is a moderately serious challenge. Basically, you want to add attributes to files, keep the attributes when the file is copied or moved, and then modify ls to display the values of these attributes.
So, here's how I would attack the problem.
1) Store the information in a sqlLite database. You can probably get away with one table. The table should contain the complete path to the file, and your comment. I'd name the database something like ~/.dirinfo/dirinfo.db. I'd store it in a subfolder, because you may find later on that you need other information in this folder. It'd be nice to use inodes rather than pathnames, but they change too frequently. Still, you might be able to do something where you store both the inode and the pathname, and retrieve by pathname only if the retrieval by inode fails, in which case you'd then update the inode information.
2) write a bash script to create/read/update/delete the comment for a given file.
3) Write another bash function or script that works with ls. I wouldn't call it "ls" though, because you don't want to mess with all the command line options that are available to ls. You're going to be calling ls always as ls -1 in your script, possibly with some sort options, such as -t and/or -r. Anyway, your script will call ls -1 and loop through the output, displaying the file name, and the comment, which you'll look up using the script from 2). You may also want to add file size, but that's up to you.
4) write functions to replace mv and cp (and ln??). These would be wrapper functions that would update the information in your table, and then call the regular Unix versions of these commands, passing along any arguments received by the functions (i.e. "$#"). If you're really paranoid, you'd also do it for things like scp, which can be used (inefficiently) to copy files locally. Still, it's unlikely you'll catch all the possibilities. What if someone else does a mv on your file, who doesn't have the function you have? What if some script moves the file by calling /bin/mv? You can't easily get around these kinds of issues.
Or if you really wanted to get adventurous, you'd write some C/C++ code to do this. It'd be faster, and honestly not all that much more challenging, provided you understand fork() and exec(). I can't recall whether sqlite has a C API. I assume it does. You'd have to tangle with that, too, but since you only have one database, and one table, that shouldn't be too challenging.
You could do it in perl, too, but I'm not sure that it would be that much easier in perl, than in bash. Your actual code isn't that complex, and you're not likely to be doing any crazy regex stuff or string manipulations. There are just lots of small pieces to fit together.
Doing all of this is much more work than should be expected for a person answering a question here, but I've given you the overall design. Implementing it should be relatively easy if you follow the design above and can live with the constraints.

Does inotifywait on Linux allow collection of events over a timeout period?

cf. FSEvents on OSX, which by default collects FS events over 1 second (timeout configurable) before firing off the event.
This has the benefit of collecting a series of filesystem changes into a single event (so the script won't run more than it needs to), at the cost of latency.
For instance, saving a file in Vim modifies many temp files (it tends to delete a buffer file, update an undo file, and also creates and then erases a test file called 4193) in addition to the file itself. On OSX with a small tool that uses this API such as my fork of fswatch, all of these can get collapsed into one "batch event", whereas with inotifywait -m all the events that I specify come over the stream in separate lines making it not simple to group without external processing.
I'm pretty sure the solution is to just to wrap it and do this processing but I was hoping there was a hidden feature to specify a timeout like the FSEvents allows for.
I actually am starting to believe that this sort of thing should not be within the scope of inotify's features.
I haven't quite found the proper solution, but it looks to me like there is some form of elegant way to do it. Here's my starting point (which quits if nothing is seen in a second, i want to have something accumulate stuff over one second)
Currently doing some testing with this. Here's some test scripting I've got working quite well.
group=0
( for val in {1..10}; do echo "$RANDOM/10000" | bc | xargs sleep; echo $val; done ) | while true; do while read -t 1 line; do echo "read $group $line"; done; ((group++)); done
I implemented https://github.com/bronger/watchdog, which may help people with this use case. “watchdog” allows to accumulate events before firing. Moreover, it bundles equivalent events (e.g. multiple writes to the same file, or deleting a file immediately after changing it). When firing, it calls one of three scripts: “copy” (one file was changed), ”delete” (one file/directory was deleted), or ”bulk_sync” (anything else). The watchdog proceeds with collecting events even while the script is running so that nothing gets lost.
I wrote it for efficient synchronisation of local changes with a remote computer. But I myself also use it for other things by just symlinking all three scripts to the same one.
I had a similar problem recently and wanted to try and stay light on dependencies, so came up with the script below. inotifywait emits all events that happen in your watched directory, but you can format its output. So, I just format the output as the event's Unix timestamp and compare that to a timer to cap how frequently the desired "sync" command runs.
#!/usr/bin/env bash
set -e # exit on errors
# Batch changes every 15s.
next_allowed_run=$(date +%s)
batch_window_seconds=15
inotifywait \
--monitor /path/to/folder \
--recursive \
--event=create \
--event=modify \
--event=attrib \
--event=delete \
--event=move \
--format='%T' \
--timefmt='%s' |
while read event_time; do
# If events arrive before the next allowed command-run, just skip them.
if [[ $event_time -ge $next_allowed_run ]]; then
next_allowed_run=$(date --date="${batch_window_seconds}sec" +%s)
sleep $batch_window_seconds # Wait for additional changes
foobobulate /path/to/folder
fi
done

Redirect program output without changing directory

Problem
I'm writing a set of scripts to help with automated batch job execution on a cluster.
The specific thing I have is a $OUTPUT_DIR, and an arbitrary $COMMAND.
I would like to execute the $COMMAND such that its output ends up in $OUTPUT_DIR.
For example, if COMMAND='cp ./foo ./bar; mv ./bar ./baz', I would like to run it such that the end result is equivalent to cp ./foo ./$OUTPUT_DIR/baz.
Ideally, the solution would look something like eval PWD="./$OUTPUT_DIR" $COMMAND, but that doesn't work.
Known solutions
[And their problems]
Editing $COMMAND: In most cases the command will be a script, or a compiled C or FORTRAN executable. Changing the internals of these isn't an option.
unionfs, aufs, etc.: While this is basically perfect, users running this won't have root, and causing thousands+ of arbitrary mounts seems like a questionable choice.
copying/ hard/soft links: This might be the solution I will have to use: some variety of actually duplicating the entire content of ./ into ./$OUTPUT_DIR
cd $OUTPUT_DIR; ../$COMMAND : Fails if $COMMAND ever reads files
pipes : only works if $COMMAND doesn't directly work with files; which it usually does
Is there another solution that I'm missing, or is this request actually impossible?
[EDIT:]Chosen Solution
I'm going to go with something where each object in the directory is symbolic-linked into the output directory, and the command is then run from there.
This has the downside of creating a lot of symbolic links, but it shouldn't be too bad.
You can't solve this without making some assumptions about the interface of $COMMAND. There is no single definition of what "output ends up in $OUTPUT_DIR" means. For one program this may be some files, but another program might just print something to stdout and yet another might try sending some data over the internet using some protocol or display something in a GUI and there isn't an obvious way of mapping all of these to "output goes to $OUTPUT_DIR".
So, you need to invent some assumptions and require any $COMMAND implementation to follow them. Then, it may get as simple as requesting that the command accept a parameter such as --target=<DIR>. If your command was some simple command, you would have to create a wrapper script around it to translate that parameter into what the app accepts. cp, mv and a few more utils already accept the parameter --target, so that may be a good starting point.
You cannot set the output directory, you can only set the working directory.
The problem is, once you set the working directory, other references are going to be invalid. For example in your code foo:
cp ./foo ./bar
If you have a specific command, there are workarounds (creating a script that alters arguments, prepending the directory to specific arguments), but in general this is not possible.

Resources