How to specify more inputs as a single input in Linux command-line? - linux

I searched online but I didn't find anything that could answer my question.
I'm using a java tool in Ubuntu Linux, calling it with bash command; this tool has two paths for two different input files:
java -Xmx8G -jar picard.jar FastqToSam \
FASTQ=6484_snippet_1.fastq \ #first read file of pair
FASTQ2=6484_snippet_2.fastq \ #second read file of pair
[...]
What I'd like to do is for example, instead of specify the path of a single FASTQ, specify the path of two different files.
So instead of having cat file1 file2 > File and using File as input of FASTQ, I'd like that this operation would be executed on the fly and create the File on the fly, without saving it on the file system (that would be what happens with the command cat file1 file2 > File).
I hope that I've been clear in explaining my question, in case just ask me and I'll try to explain better.

Most well-written shell commands which accept a file name argument also usually accept a list of file name arguments. Like cat file or cat file1 file2 etc.
If the program you are trying to use doesn't support this, and cannot easily be fixed, perhaps your OS or shell makes /dev/stdin available as a pseudo-file.
cat file1 file2 | java -mumble -crash -burn FASTQ=/dev/stdin
Some shells also have process substitutions, which (typically) look to the calling program like a single file containing whatever the process substitution produces on standard output.
java -mumble -crash -burn FASTQ=<(cat file1 file2) FASTQ2=<(cat file3 file4)
If neither of these work, a simple shell script which uses temporary files and deletes them when it's done is a tried and true solution.
#!/bin/sh
: ${4?Need four file name arguments, will process them pairwise}
t=$(mktemp -d -t fastqtwoness.XXXXXXX) || exit
trap 'rm -rf $t' EXIT HUP INT TERM # remove in case of failure or when done
cat "$1" "$2" >$t/1.fastq
cat "$3" "$4" >$t/2.fastq
exec java -mumble -crash -burn FASTQ=$t/1.fastq FASTQ2=$t/2.fastq

Related

Assigning program output to a variable in shell script

I have a tool (written in C) that takes in output file parameter to which the tool writes some output string.
tool -o output-file-name
I would like to invoke the tool from a shell script and have the output string assigned to a variable.
I tried:
var=$(tool -o a.txt 1>/dev/null;cat a.txt && rm a.txt)
The above works, but I would like a more elegant solution.
P.S: I am far from a scripting guru.
You can start the tool without the option -o
tool
or like William Pursell wrote with /dev/stdout as file
tool -o /dev/stdout
or in short form
tool -o -
I'm assuming not possible to change to code to support output to stdout. Probably, because the code is sending additional information (logging) to stdout.
You can use IO redirection or process substitution to eliminate the need to save/recall the data from the file. The construct '>(cat)' argument tell bash the generate a pipe, between 'tool' and 'cat' (which will simply print the file).
# PREFERRED
# IO Redirection, using extra fd (#3).
var=$(./f.sh -o /dev/fd/3 3>&1 1>/dev/null)
# Use process substitution, use extra 'cat' process.
var=$(tool -o >(cat) >/dev/null)
1st solution more efficient (no extra cat running ...). 2nd solution is slightly more compact.
Since you're on linux, you can use tool -o /dev/stdout

Pass each file obtained from a command to another command as a parameter

I am using the following line to take a pdf and split it:
pdfseparate -f 14 -l 23 ALF.SS.0.pdf "${FILE}"-%d.pdf
Now I want for each file produced, to run several commands like this:
pdfcrop --margins '-30 0 -385 0' outputOfpdfSeparate outputOfpdfSeparate-1stCol.pdf
I am trying to figure out the best way to do this:
With a single loop, for each file created by pdfseparate, if I manage to "know" what is the name of the file, I could pass it to pdfcrop and done. But since it is using %d, I do not know how to handle this "new name" in which each file has a new number. I know how to do this in Java but here I do not see it so clear.
Using pipes. I think I have the same issue since if I do
pdfseparate [options] | pdfcrops inputfile outputfile,
I do not know how to "use" the name of inputfile. I am sure it is easy but I dont see it.
Using xargs. I am studying this command since it is new for me.
Using exec. I am under the impression this is not necessary but maybe I am wrong since it's been a long while since I last used exec.
Thanks in advance.
You can use xargs. It is the best way in terms of speed.
I usually use it for converting a lot of .mp4 file to .mp3.
Doing this conversion one-by-one not only is tedious but also takes a long time. Therefore you can use the auto parallel mechanism with the help of -P 0 option in xargs
for example if I had 10 .mp4 files I would do this:
ls *.mp4 | xargs -I xxx -P 0 ffmpeg -i xxx xxx.mp3
After running this line; 10 ffmpet commands are running simultaneously.
The other way to do this is storing a list of .mp4 file in a text file like this:
ls *.mp4 > list-mp4
then:
xargs -I xxx -P 0 ffmpeg -i xxx xxx.mp3 < list-mp4
Or may you have access to GNU-parallel. Thus you can:
parallel ffmpeg -i {} {}.mp3 ::: *.mp4
Now for your case; if you want to use these (= xargs or parallel) or your own command, you should notice that your first command should send its output to stdout. Because the second command is going to read its stdin from the stdout of the first command and bash does this for your.
Thus when you can use pipe == | with your: pdfseparate than it sends its output to stdout. If it does/can NOT, then the right-side of the pipe == the second command does nothing and vice versa: the second command should/can read its stdin from incoming stdout.
For example
ls *.txt | echo {}
here echo does not read any incoming stdout from the ls command and just prints {}
Eventually, your pdfseparate should send to stdout. Then xargs store it in -I anything-your-like and then call your second command
Therefor:
pdfseparate options... | xargs -I ABC -P 0 your-second-command+its-options ABC
NOTE-1 that xargs stores the given stdout line-by-line in ABC and you pass this to your second command as its input
NOTE-2 you do not have to use -P 0 at all. It is just for speeding up the executing time. You can omit it but your second command are synchronized per incoming line.
pdfseparate does not output the files it created, thus you have to use "ls" command to get the filelist, you want to operate on.
# separate the pdfs
pdfseparate -f 14 -l 23 ALF.SS.0.pdf "${FILE}"-%d.pdf
# operate on the just created files, assumes that a "FILE" variable is set, which might not be the case
for i in $(ls "${FILE}-*.pdf"); do pdfcrop --margins '-30 0 -385 0' $i; done;
# assuming that FILE variable in your case would match ALF.SS.0-[0-9]*.pdf, you'd use this:
for i in $(ls ALF.SS.0-[0-9]*.pdf); do pdfcrop --margins '-30 0 -385 0' $i; done;

run cat command for all the files in the directory given in argument of the script file and out put with the name given as second argument

I run the following code for concatenating files in a directory given as the argument for the script file in bash
for i in $*
do
cat $* > /home/christy/Documents/filetest/catted.txt
done
This produce the error
cat: /home/christy/Documents/filetest/catted.txt: input file is output file
I think there are at least 4 things wrong with your script....
Firstly, your loop will set the value of i to the name of each file in succession, so you would want to actually use i inside your loop, like this:
for i in $*
cat "$i" ....somewhere
done
Secondly, if you use the > redirection, each file will land exactly on top of the previous one, so you should really use the >> redirection will append the current file to the end of the previous one like this
for i in $*
do
cat "$i" >> ...somewhere
done
Thirdly, I think you should use double-quoted "$#" to get all your command-line arguments, rather than plain $*
for i in "$#"
...
Fourthly, you can achieve the exact effect I think you want with this simpler command:
cat "$#" > /home/christy/Documents/filetest/catted.txt
You can't cat a file back onto itself. That's what "input file is output file" means. Because catted.txt shows up in your list of arguments to cat, it is going to try to cat to itself. So, move catted.txt to somewhere other than the source directory.

Bash Sorting Redirection

What are the differences between sort file1 -o file2 and sort file1 > file2 ? So far from what I have done they do the same thing but perhaps I'm missing something.
Following two commands are similar as long as file1 and file2 are different.
sort file1 -o file2 # Output redirection within sort command
sort file1 > file2 # Output redirection via shell
Let's see what happens when input and output files are same file i.e. you try to sort in-place
sort file -o file # Works perfectly fine and does in-place sorting
sort file > file # Surprise! Generates empty file. Data is lost :(
In summary, above two redirection methods are similar but not the same
Test
$ cat file
2
5
1
4
3
$ sort file -o file
$ cat file
1
2
3
4
5
$ sort file > file
$ cat file
$ ls -s file
0 file
The result is the same but in the case of -o file2 the resulting file is created by sort directly while in the other case, it is created by bash and filled with the standard output of sort. The xfopen defined in line 450 of sort.c in coreutils treats both cases (stdout and -o filename) equally.
Redirecting the standard output of sort is more generic as it could be redirected to another program with a | in place of a >, which the -o option makes more difficult to do (but not impossible)
The -o option is handy for in place sorting as the redirection to the same file will lead to a truncated file because it is created (and truncated) by the shell prior to the invocation of sort.
There is not much difference > is a standard unix output redirection function. That is to say 'write your output that you would otherwise display on the terminal to the given file' The -o option is more specific to the sort function. It is a way to again say 'write the output to this given file'
The > can be used where a tool does not specifically have a write to file argument or option.

How to extract substring from a text file in bash?

I have lots of strings in a text file, like this:
"/home/mossen/Desktop/jeff's project/Results/FCCY.png"
"/tmp/accept/FLWS14UU.png"
"/home/tten/Desktop/.wordi/STSMLC.png"
I want to get only the file names from the string as I read the text file line by line, using a bash shell script. The file name will always end in .png and will always have the "/" in front of it. I can get each string into a var, but what is the best way to extract the filenames (FCCY.png, FLWS14UU.png, etc.) into vars? I can't count on the user having Perl, Python, etc, just the standard Unix utils such as awk and sed.
Thanks,
mossen
You want basename:
$ basename /tmp/accept/FLWS14UU.png
FLWS14UU.png
basename works on one file/string at a time. If you have many strings you will be iterating the file and calling external command many times.
use awk
$ awk -F'[/"]' '{print $(NF-1)}' file
FCCY.png
FLWS14UU.png
STSMLC.png
or use the shell
while read -r line
do
line=${line##*/}
echo "${line%\"}"
done <"file"
newlist=$(for file in ${list} ;do basename ${file}; done)
$ var="/home/mossen/Desktop/jeff's project/Results/FCCY.png"
$ file="${var##*/}"
Using basename iteratively has a huge performance hit. It's small and unnoticeable when you're doing it on a file or two but adds up over hundreds of them. Let me do some timing tests for you to exemplify why using basneame (or any system util callout) is bad when an internal feature can do the job -- Dennis and ghostdog74 gave you the more experienced BASH answers.
Sample input files.txt (list of my pics with full path): 3749 entries
external.sh
while read -r line
do
line=`basename "${line}"`
echo "${line%\"}"
done < "files.txt"
internal.sh
while read -r line
do
line=${line##*/}
echo "${line%\"}"
done < "files.txt"
Timed results, redirecting output to /dev/null to get rid of any video lag:
$ time sh external.sh 1>/dev/null
real 0m4.135s
user 0m1.142s
sys 0m2.308s
$ time sh internal.sh 1>/dev/null
real 0m0.413s
user 0m0.357s
sys 0m0.021s
The output of both is identical:
$ sh external.sh | sort > result1.txt
$ sh internal.sh | sort > result2.txt
$ diff -uN result1.txt result2.txt
So as you can see from the timing tests you really want to avoid any external calls to system utilities when you can write the same feature in some creative BASH code/lingo to get the job done, especially when it's going to be called a whole lot of times over and over.

Resources