Bash: Loop through file and read substring as argument, execute multiple instances - linux

How it is now
I currently have a script running under windows that frequently invokes recursive file trees from a list of servers.
I use an AutoIt (job manager) script to execute 30 parallel instances of lftp (still windows), doing this:
lftp -e "find .; exit" <serveraddr>
The file used as input for the job manager is a plain text file and each line is formatted like this:
<serveraddr>|...
where "..." is unimportant data. I need to run multiple instances of lftp in order to achieve maximum performance, because single instance performance is determined by the response time of the server.
Each lftp.exe instance pipes its output to a file named
<serveraddr>.txt
How it needs to be
Now I need to port this whole thing over to a linux (Ubuntu, with lftp installed) dedicated server. From my previous, very(!) limited experience with linux, I guess this will be quite simple.
What do I need to write and with what? For example, do I still need a job man script or can this be done in a single script? How do I read from the file (I guess this will be the easy part), and how do I keep a max. amount of 30 instances running (maybe even with a timeout, because extremely unresponsive servers can clog the queue)?
Thanks!

Parallel processing
I'd use GNU/parallel. It isn't distributed by default, but can be installed for most Linux distributions from default package repositories. It works like this:
parallel echo ::: arg1 arg2
will execute echo arg1 and and echo arg2 in parallel.
So the most easy approach is to create a script that synchronizes your server in bash/perl/python - whatever suits your fancy - and execute it like this:
parallel ./script ::: server1 server2
The script could look like this:
#!/bin/sh
#$0 holds program name, $1 holds first argument.
#$1 will get passed from GNU/parallel. we save it to a variable.
server="$1"
lftp -e "find .; exit" "$server" >"$server-files.txt"
lftp seems to be available for Linux as well, so you don't need to change the FTP client.
To run max. 30 instances at a time, pass a -j30 like this: parallel -j30 echo ::: 1 2 3
Reading the file list
Now how do you transform specification file containing <server>|... entries to GNU/parallel arguments? Easy - first, filter the file to contain just host names:
sed 's/|.*$//' server-list.txt
sed is used to replace things using regular expressions, and more. This will strip everything (.*) after the first | up to the line end ($). (While | normally means alternative operator in regular expressions, in sed, it needs to be escaped to work like that, otherwise it means just plain |.)
So now you have list of servers. How to pass them to your script? With xargs! xargs will put each line as if it was an additional argument to your executable. For example
echo -e "1\n2"|xargs echo fixed_argument
will run
echo fixed_argument 1 2
So in your case you should do
sed 's/|.*$//' server-list.txt | xargs parallel -j30 ./script :::
Caveats
Be sure not to save the results to the same file in each parallel task, otherwise the file will get corrupt - coreutils are simple and don't implement any locking mechanisms unless you implement them yourself. That's why I redirected the output to $server-files.txt rather than files.txt.

Related

Proxying commands across a SSH connection

I'd like to be able to "spoof" certain commands on my machine, actually invoking them on a remote system. For example. Whenever I run:
cmd options
I'd like the actual command to be:
ssh user#host cmd options
Ideally I'd like to have a folder called spoof, add it to my PATH, and have an executable in there called cmd which does the spoofing. If I have a lot of commands, this could get tedious. Anyone have ideas of a good way to go about this? Such that I can add and remove a lot of commands in the future? And, I'd like to be able to pass all the arguments exactly (or as exact as possible) and every single command I want to spoof would just have the ssh user#host in front of it.
The reason for this is I'm running a container (specifically singularity) on my machine, and there are certain commands I don't really want to containerize, but still want to run from within the container. I've found I can get the functionality I want by just appending the ssh in front of it. Examples are sbatch and matlab which are a pain to containerize and I'm fine with just using ssh to call them. Files that these programs use are written to a bind point so the host machine can see them just fine.
The following script can be hardlinked under all the names of commands you wish to transparently proxy:
#!/usr/bin/env bash
printf -v str '%q ' "${0##*/}" "$#"
ssh host "$str"
SSH combines all its arguments into a single string, which is then executed by a remote shell. To ensure that the remote arguments are identical to the local one, the values need to be escaped; otherwise, somecommand "hello world" and somecommand "hello" "world" can be represented identically over-the-wire.
In an appropriately extended printf (including both bash and ksh implementations), %q is replaced with an escaped form of the corresponding value, which will be evaled back to the original (literal) text by if interpreted later.
printf -v varname stores the output of printf in a variable named varname without the overhead/inefficiency of a command substitutions. (In ksh93, varname=$(printf ...) is optimized to skip subshell overhead, so this is not necessary there).
$0 evaluates to argv[0], which is by convention the name of the command currently being run. (This can be overridden, but you trust your users to behave reasonably... right?)
${0##*/} is a parameter expansion which returns only content after the last / in $0 (should it in fact contain any slashes; otherwise, the original value is used unmodified).
"$#" refers to the exact argument vector passed to your script.

cd && ls | grep: How to execute a command in the current shell and pass the output

I created an alias in order not to write ls every time I move into a new directory:
alias cl='cd_(){ cd "$#" && ls; }; cd_'
Let us say I have a folder named "Downloads" (which of course I happen to have) so I just type the following in the terminal:
cl Downloads
Now I will find myself in the "Downloads" folder and receive a list of the stuff I have in the folder, like say: example.txt, hack.hs, picture.jpg,...
If I want to move to a directory and look if there is, say, hack.hs I could try something like this:
cl Downloads | grep hack
What I get is just the output:
hack.hs
But I will remain in the folder I was (which means I am not in Downloads).
I understand this happens because every command is executed in a subshell, and thus cd Downloads && ls is executed in a subshell of its own and then the output (namely the list of stuff I have) gets redirected via the pipe to grep. This is why I then am not in the new folder.
My question is the following:
How do I do it in order to be able to write something like "cl Downloads | grep hack" and get the "hack"-greped list of stuff AND be in the Downloads folder?
Thank you very much,
Pol
For anyone ever googling this:
A quick fix was proposed by #gniourf_gniourf :
cl Downloads > >(grep hack)
Some marked this question as a possible duplicate of Make bash alias that takes duplicates, but the fact that my bash alias already takes arguments shows that this is not the case. The problem at hand was about how to execute a command in the current shell while at the same time redirecting the output to another command.
As you're aware (and as is covered in BashFAQ #24), the reason
{ cd "$#" && ls; } | grep ...
...prevents the results of cd being visible in the outer shell is that no component of a pipeline is guaranteed by POSIX to be run in the outer shell. (Some shells, including ksh [out-of-the-box] and very modern bash with non-default options enabled, will occasionally or optionally run the last piece of a pipeline in the parent shell, but this can't portably be relied on).
A way to avoid this, that's applicable to all POSIX shells, is to direct output to a named pipe, thus avoiding setting up a pipeline:
mkfifo mypipe
grep ... <mypipe &
{ cd "$#" && ls; } >mypipe
In modern ksh and bash, there's a shorter syntax that will do this for you -- using /dev/fd entries instead of setting up a named pipe if the operating system provides that facility:
{ cd "$#" && ls; } > >(grep ...)
In this case, >(grep ...) is replaced with a filename that points to either a FIFO or a /dev/fd entry that, when written to by the process in question, redirects output to grep -- but without a pipeline.
By the way -- I really do hope your use of ls in this manner is as an example. The output of ls is not well-specified for the range of all possible filenames, so grepping it is innately unreliable. Consider using printf '%s\0' * to emit a NUL-delimited list of non-hidden names in a directory, if you really do want to build a streamed result; or using glob expressions to check for files matching a specific pattern (BashFAQ #4 covers a similar scenario); extglobs are available if you need something closer to full regex matching support than POSIX patterns support.

Run two shell script in parallel and capture their output

I want have a shell script, which configure several things and then call two other shell scripts. I want these two scripts run in parallel and I want to be able to get and print their live output.
Here is my first script which calls the other two
#!/bin/bash
#CONFIGURE SOME STUFF
$path/instance2_commands.sh
$path/instance1_commands.sh
These two process trying to deploy two different application and each of them took around 5 minute so I want to run them in parallel and also see their live output so I know where are they with the deploying tasks. Is this possible?
Running both scripts in parallel can look like this:
#!/bin/bash
#CONFIGURE SOME STUFF
$path/instance2_commands.sh >instance2.out 2>&1 &
$path/instance1_commands.sh >instance1.out 2>&1 &
wait
Notes:
wait pauses until the children, instance1 and instance2, finish
2>&1 on each line redirects error messages to the relevant output file
& at the end of a line causes the main script to continue running after forking, thereby producing a child that is executing that line of the script concurrently with the rest of the main script
each script should send its output to a separate file. Sending both to the same file will be visually messy and impossible to sort out when the instances generate similar output messages.
you may attempt to read the output files while the scripts are running with any reader, e.g. less instance1.out however output may be stuck in a buffer and not up-to-date. To fix that, the programs would have to open stdout in line buffered or unbuffered mode. It is also up to you to use -f or > to refresh the display.
Example D from an article on Apache Spark and parallel processing on my blog provides a similar shell script for calculating sums of a series for Pi on all cores, given a C program for calculating the sum on one core. This is a bit beyond the scope of the question, but I mention it in case you'd like to see a deeper example.
It is very possible, change your script to look like this:
#!/bin/bash
#CONFIGURE SOME STUFF
$path/instance2_commands.sh >> script.log
$path/instance1_commands.sh >> script.log
They will both output to the same file and you can watch that file by running:
tail -f script.log
If you like you can output to 2 different files if you wish. Just change each ling to output (>>) to a second file name.
This how I end up writing it using Paul instruction.
source $path/instance2_commands.sh >instance2.out 2>&1 &
source $path/instance1_commands.sh >instance1.out 2>&1 &
tail -q -f instance1.out -f instance2.out --pid $!
wait
sudo rm instance1.out
sudo rm instance2.out
My logs in two processes was different so I didn't care if aren't all together, that is why I put them all in one file.

Stream specific numbered Bash file descriptor into variable

I am trying to stream a specific numbered file descriptor into a variable in Bash. I can do this from normal standard in using the following function, but, how do it do it from a specific file descriptor. I need to direct the FD into the sub-shell if I use the same approach. I could always do it reading line by line, but, if I can do it in a continuous stream then that would be massively preferable.
The function I have is:
streamStdInTo ()
{
local STORE_INvar="${1}" ; shift
printf -v "${STORE_INvar}" '%s' "$( cat - )"
}
Yes, I know that this wouldn't work normally as the end of a pipeline would be lost (due to its execution in a sub-shell), however, either in the context of the Bash 4 set +m ; shopt -s lastpipe method of executing the end of a pipeline in the same shell as the start, or, by directing into this via a different file descriptor I am hoping to be able to use it.
So, my question is, How do I use the above but with different file descriptors than the normal?
It's not entirely clear what you mean, but perhaps you are looking for something like:
cat - <&4 # read from fd 4
Or, just call your current function with the redirect:
streamStdInTo foo <&4
edit:
Addressing some questions from the comment, you can use a fifo:
#!/bin/bash
trap 'rm -f $f' 0
f=$(mktemp xxx)
rm $f
mkfifo $f
echo foo > $f &
exec 4< $f
cat - <&4
wait
I think there's a lot of confusion about what exactly you're trying to do. If I understand correctly the end goal here is to run a pipeline and capture the output in a variable, right? Kind of like this:
var=$(cmd1 | cmd2)
Except I guess the idea here is that the name of "$var" is stored in another variable:
varname=var
You can do an end-run around Bash's usual job control situation by using process substitution. So instead of this normal pipeline (which would work in ksh or zsh, but not in bash unless you set lastpipe):
cmd1 | cmd2 | read "$varname"
You would use this command, which is equivalent apart from how the shell handles the job:
read "$varname" < <(cmd1 | cmd2)
With process substitution, "read $varname" isn't run in a pipeline, so Bash doesn't fork to run it. (You could use your streamStdInTo() function there as well, of course)
As I understand it, you wanted to solve this problem by using numeric file descriptors:
cmd1 | cmd2 >&$fd1 &
read "$varname" <&$fd2
To create those file descriptors that connect the pipeline background job to the "read" command, what you need is called a pipe, or a fifo. These can be created without touching the file system (the shell does it all the time!) but the shell doesn't directly expose this functionality, which is why we need to resort to mkfifo to create a named pipe. A named pipe is a special file that exists on the filesystem, but the data you write to it doesn't go to the disk. It's a data queue stored in memory (a pipe). It doesn't need to stay on the filesystem after you've opened it, either, it can be deleted almost immediately:
pipedir=$(mktemp -d /tmp/pipe_maker_XXXX)
mkfifo ${pipedir}/pipe
exec {temp_fd}<>${pipedir}/pipe # Open both ends of the pipe
exec {fd1}>${pipedir}/pipe
exec {fd2}<${pipedir}/pipe
exec {temp_fd}<&- # Close the read/write FD
rm -rf ${pipedir} # Don't need the named FIFO any more
One of the difficulties in working with named pipes in the shell is that attempting to open them just for reading, or just for writing causes the call to block until something opens the other end of the pipe. You can get around that by opening one end in a background job before trying to open the other end, or by opening both ends at once as I did above.
The "{fd}<..." syntax dynamically assigns an unused file descriptor number to the variable $fd and opens the file on that file descriptor. It's been around in ksh for ages (since 1993?), but in Bash I think it only goes back to 4.1 (from 2010).

Shell Script - Linux

I want to write a very simple script , which takes a process name , and return the tail of the last file name which contains the process name.
I wrote something like that :
#!/bin/sh
tail $(ls -t *"$1"*| head -1) -f
My question:
Do I need the first line?
Why isn't ls -t *"$1"*| head -1 | tail -f working?
Is there a better way to do it?
1: The first line is a so called she-bang, read the description here:
In computing, a shebang (also called a
hashbang, hashpling, pound bang, or
crunchbang) refers to the characters
"#!" when they are the first two
characters in an interpreter directive
as the first line of a text file. In a
Unix-like operating system, the
program loader takes the presence of
these two characters as an indication
that the file is a script, and tries
to execute that script using the
interpreter specified by the rest of
the first line in the file
2: tail can't take the filename from the stdin: It can either take the text on the stdin or a file as parameter. See the man page for this.
3: No better solution comes to my mind: Pay attention to filenames containing spaces: This does not work with your current solution, you need to add quotes around the $() block.
$1 contains the first argument, the process name is actually in $0. This however can contain the path, so you should use:
#!/bin/sh
tail $(ls -rt *"`basename $0`"*| head -1) -f
You also have to use ls -rt to get the oldest file first.
You can omit the shebang if you run the script from a shell, in that case the contents will be executed by your current shell instance. In many cases this will cause no problems, but it is still a bad practice.
Following on from #theomega's answer and #Idan's question in the comments, the she-bang is needed, among other things, because some UNIX / Linux systems have more than one command shell.
Each command shell has a different syntax, so the she-bang provides a way to specify which shell should be used to execute the script, even if you don't specify it in your run command by typing (for example)
./myscript.sh
instead of
/bin/sh ./myscript.sh
Note that the she-bang can also be used in scripts written in non-shell languages such as Perl; in the case you'd put
#!/usr/bin/perl
at the top of your script.

Resources