shell script to process data from two input files - linux

I have a lot of fastq files for ~100 samples (two per samples: reads1 and reads2). For each sample, I need to input the two fastq files into Prinseq, a perl program. Ideally, this would be great to do with a shell script for all samples so I don't have to manually call this program 100 times, but I don't know how to indicate two input files, just one (i.e., for i in *.fastq; do [perl commands]; done). If it helps, Prinseq command format is as follows:
perl prinseq-lite.pl -fastq [file for reads1] -fastq2 [file for
reads2] -derep [options]
This is probably a very easy answer, but I can't find it.

You can loop over all R1 files and use parameter expansion to chop off the R1_001.fastq part (and replace it by the R2 version):
for i in *_R1_001.fastq; do
perl prinseq-lite.pl -fastq "$i" -fastq2 "${i%R1_001.fastq}R2_001.fastq"
done

Related

How to split large file to small files with prefix in Linux/Bash

I have a file in Linux called test. Now I want to split the test into say 10 small files.
The test file has more than 1000 table names. I want the small files to have equal no of lines, the last file might have the same no of table names or not.
What I want is can we add a prefix to the split files while invoking the split command in the Linux terminal.
Sample:
test_xaa test_xab test_xac and so on..............
Is this possible in Linux.
I was able to solve my question with the following statement
split -l $(($(wc -l < test.txt )/10 + 1)) test.txt test_x
With this I was able to get the desired result
I would've sworn split did this on it's own, but to my surprise, it does not.
To get your prefix, try something like this:
for x in /path/to/your/x*; do
mv $x your_prefix_$x
done

Generate file names for proper sequential sorting under shell globing

I was generating a sequence of png images in my program, which files I was planning to get is passed through some tool that converts them to a video file. I am generating files one by one, in the proper sequence that I want them. I want to name them in such a way that the subsequent video conversion tool will take them in proper sequence under the file name globbing used by the shell ( I am using bash with Linux.). I tried adding a numeric sequence like 'scene1.png, scene10.png, scene12.png, but the shell doesn't sort globs numerically. I could pass a sorted list like this:
convert -antialias -delay 1x10 $(ls povs/*.png | sort -V) mymovie.mp4
But some programs do their own globbing and don't use shells globbing ( like FFmpeg), and so this approach does not always work. so I am looking for a scheme of naming files that are guaranteed to be in sequence as per shell globbing rules.
You may prefix your files with a zero padded integer.
This script emulates what ls * should output after renaming :
$ for i in {1..12};do
$ printf '%05d_%s\n' ${i} file${i}
$ done;
00000_file0
00001_file1
00002_file2
00003_file3
00004_file4
00005_file5
00006_file6
00007_file7
00008_file8
00009_file9
00010_file10
00011_file11

how to use do loop to read several files with similar names in shell script

I have several files named scale1.dat, scale2.dat scale3.dat ... up to scale9.dat.
I want to read these files in do loop one by one and with each file I want to do some manipulation (I want to write the 1st column of each scale*.dat file to scale*.txt).
So my question is, is there a way to read files with similar names. Thanks.
The regular syntax for this is
for file in scale*.dat; do
awk '{print $1}' "$file" >"${file%.dat}.txt"
done
The asterisk * matches any text or no text; if you want to constrain to just single non-zero digits, you could say for file in scale[1-9].dat instead.
In Bash, there is a non-standard additional glob syntax scale{1..9}.dat but this is Bash-only, and so will not work in #!/bin/sh scripts. (Your question has both sh and bash so it's not clear which you require. Your comment that the Bash syntax is not working for you suggests that you may need a POSIX portable solution.) Furthermore, Bash has something called extended globbing, which allows for quite elaborate pattern matching. See also http://mywiki.wooledge.org/glob
For a simple task like this, you don't really need the shell at all, though.
awk 'FNR==1 { if (f) close (f); f=FILENAME; sub(/\.dat/, ".txt", f); }
{ print $1 >f }' scale[1-9]*.dat
(Okay, maybe that's slightly intimidating for a first-timer. But the basic point is that you will often find that the commands you want to use will happily work on multiple files, and so you don't need shell loops at all in those cases.)
I don't think so. Similar names or not, you will have to iterate through all your files (perhaps with a for loop) and use a nested loop to iterate through lines or words or whatever you plan to read from those files.
Alternatively, you can copy your files into one (say, scale-all.dat) and read that single file.

How does linux redirect IO work internally

When we use the redirect IO operator for a shell script does the operator keep all the data to be written in memory and write it all at once or does write it to file line by line.
Here is what i am working on.
I have about 200 small files ~1000 lines each in a specific format. I want to process (do a regex and change the format a little) each line in all the files and have the new transformed lines in a single combined file.
I have a transformscript.sh that takes a single file and applies the transformation. I run it in the following manner
sh transformscript.sh somefile.txt > newfile.txt
This works fine and fast for a single file.
How do i extend to do it for all the files. will it be efficient to change transformscript.sh to take a directory as argument instead of filename and add a for loop to transform all the lines of all the files together. Or should I run the above trnsformscript.sh for each file and create a new file for each one and combine then separately.
Thanks.
The redirect operator simply opens the file for writing and passes that file descriptor to the shell as its standard output. The shell then writes to the file directly.
You probably do NOT want to run the script separately for each file since you will incur the overhead of bash process creation for each pass. For example:
# don't do it this way
for somefile in $(ls somefiles*.txt); do
newfile=${somefile//some/new}
sh transformscript.sh $somefile > $newfile
done
The above starts one shell for every file found which is pretty inefficient. It would be better to rewrite transformscript.sh to handle multiple files if possible. Depending on how complicated your transform is and whether you need to keep the original filenames, you might be able to use a single sed process. For example, assume you have 200 files named test1.txt through test200.txt all with a "Hello world" line you want to change to "Hello joe". You could do something as simple a this:
sed -i.save 's/Hello world/Hello joe/' test*.txt
The -i tells sed to do an "in place" edit (edit the original file) and the optional ".save" argument to -i makes a backup copy of the original file with a .save extension before editing the original file. Note, this will leave the original contents in the .save files and the new content in the files with the original name which may not be what you want.

Comparing Two Files For Matching Words in Linux

Lets say we have two files as follows
File A.txt
Karthick is not so intelligent
He is not lazy
File B.txt
karthick is not so bad either
He is hard worker
so in the two files above, the commone words are "karthick is not so" & "He is" in each of the lines. Is there any way to print all such common lines with either grep command or some linux command?
You want to use the dwdiff utility :).
Example usage:
dwdiff "File A.txt" "File B.txt"
It might take a little while to get used to it's output, but check http://linux.die.net/man/1/dwdiff for more details on that.
There are also several visual diff applications out there, but I prefer using it on the command line.

Resources