Searching for text - linux

I'm trying to write a shell script that searches for text within a file and prints out the text and associated information to a separate file.
From this file containing list of gene IDs:
DDIT3 ENSG00000175197
DNMT1 ENSG00000129757
DYRK1B ENSG00000105204
I want to search for these gene IDs (ENSG*), their RPKM1 and RPKM2 values in a gtf file:
chr16 gencodeV7 gene 88772891 88781784 0.126744 + . gene_id "ENSG00000174177.7"; transcript_ids "ENST00000453996.1,ENST00000312060.4,ENST00000378384.3,"; RPKM1 "1.40735"; RPKM2 "1.61345"; iIDR "0.003";
chr11 gencodeV7 gene 55850277 55851215 0.000000 + . gene_id "ENSG00000225538.1"; transcript_ids "ENST00000425977.1,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";
and print/ write it to a separate output file
Gene_ID RPKM1 RPKM2
ENSG00000108270 7.81399 8.149
ENSG00000101126 12.0082 8.55263
I've done it on the command line using for each ID using:
grep -w "ENSGno" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' > output.file
but when it comes to writing the shell script, I've tried various combinations of for, while, read, do and changing the variables but without success. Any ideas would be great!

You can do something like:
while read line
do
var=$(echo $line | awk '{print $2}')
grep -w "$var" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' >> output.file
done < geneIDs.file

Related

How do I grep a string on multiple files only if the string is present in all of all the files?

I have around 20 files. The first columns of each file contains ids (ID0001, ID0056, ID0165 etc). I have a list file that contains all possible ids. I want to find the ids from that file that are present in all the files. Is there a way to use grep for this? So far if I use the command:
grep "id_name" file*.txt,
it prints the id even if it is present in only 1 file.
There is a simple grep pipeline that you can do, but it is a bit cumbersome to write down:
cut -f1 file1 | grep -Ff - file2 | grep -Ff - file3 | grep -Ff - file3 ...
Another way is using awk:
awk '{a[$1]++}END{for(i in a) if (a[i]==ARGC-1) print i}' file1 file2 file3 ...
The latter assumes that the id's are unique per file.
If they are not unique, it is a bit more tricky:
awk '(FNR==1){delete b}!($1 in b){a[$1]++;b[$1]}END{for(i in a) if (a[i]==ARGC-1) print i }' file1 file2 file3 ...
Say you have a list of all the ids in a file ids_list.txt with each ID being on a single line like
id001
id101
id201
...
And all the files from which you want to search from are in the folder data . So in this scenario, this little script should be able to help you
#!/bin/bash
all_ids="";
for i in `cat ids_list.txt`; do
all_ids="$all_ids|$i"
done
all_ids=`echo $all_ids|sed -e 's/^|//'`
grep -Pir "^($all_ids)[\s,]+" data
It output would be like
data/f1:id001, ssd
data/f3:id201, some data
...
This may be what you're trying to do but without sample input/output it's an untested guess:
awk '
!seen[FILENAME,$1]++ {
cnt[$1]++
}
END {
for (id in cnt) {
if ( cnt[id] == (ARGC-1) ) {
print id
}
}
}
' list file*

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

How to grep particular lines

I am trying to fetch Some IDs from URL.
In my script I hit the URL using while loop and wget command and I save output in file.
Then in same loop I grep XYZ User ID: and 3 lines after this string and save it to another file.
When I open this output file I find following lines.
< p >XYZ User ID:< /p>
< /td >
< td>
< p>2989288174< /p>
So using grep or any thing else how can I print following output
XYZ User ID:2989288174
Supposing a constant tag pattern:
<p>XYZ User ID:</p>
</td>
<td>
<p>2989288174</p>
grep should be the best way:
grep -oP '(?<=p>)([^>]+?)(?=<\/p)' outputfile|while read user;do
read id
echo "$user $id"
done
Note that look-behind expressions cannot be of variable length. That means you cannot use quantifiers ?, *, + , etc or alternation of different-length items inside them.
For variable length tags awk could be well suited for oneliner tags:
awk '/User ID/{print ""}/p *>/{printf $3}' FS='(p *>|<)' outputfile
This should work (sed with extended regex):
sed -nr 's#<\s*p\s*>([^>]*)<\s*/\s*p\s*>#\1#p' file | tr -d '\n'
Output:
XYZ User ID:2989288174

concatenate files awk/linux

I have n files in a folder which starts with lines as shown below.
##contig=<ID=chr38,length=23914537>
##contig=<ID=chrX,length=123869142>
##contig=<ID=chrMT,length=16727>
##samtoolsVersion=0.1.19-44428cd
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT P922_120
chr1 412573 SNP74 A C 2040.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;DP=58;
chr1 602567 BICF2G630707977 A G 877.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;
chr1 604894 BICF2G630707978 A G 2044.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;
chr1 693376 . GCCCCC GCCCC 761.73 . AC=2;AC1=2;AF=1.00;AF1=1;
There are n such files. I want to concatenate all the files into a single file such that all the lines begining with # should be deleted from all the files and concatenate the rest of the rows from all the files only retaining the header line. Example output is shown below:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT P922_120
chr1 412573 SNP74 A C 2040.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;DP=58;
chr1 602567 BICF2G630707977 A G 877.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;
chr1 604894 BICF2G630707978 A G 2044.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;
chr1 693376 . GCCCCC GCCCC 761.73 . AC=2;AC1=2;AF=1.00;AF1=1;
Specifically with awk:
awk '$0!~/^#/{print $0}' file1 file2 file3 > outputfile
Broken down you are checking if the line ($0) does not match (!~) a string beginning with # (/^#/) and if so, print the line. You take input files and write to (>) outputfile.
Your problem is not terribly well specified, but I think you are just looking for:
sed '/^##/d' $FILE_LIST > output
Where FILE_LIST is the list of input files( you may be able to use *)
If I understood correctly, you could do:
echo "#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT P922_120" > mergedfile
for file in $FILES; do cat $file | grep -v "#" >> mergedfile; done
Note that $FILES could be ls and the -v option in grep is the non-match flag.
I believe what you want is
awk '$0 ~/^##/ { next; } $0 ~ /^#/ && !printed_header {print; printed_header=1 } $0! ~ /^#/ {print }' file1 file2 file3
Or you can use grep like this:
grep -vh "^##" *
The -v means inverted, so the command means... look for all lines NOT starting ## in all files and don't print filenames (-h).
Or, if you want to emit 1 header line at the start,
(grep -m1 ^#CHROM * ; grep -hv ^## * ) > out.txt

extracting data from two list using a shell script

I am trying to create a shell script that pulls a line from a file and checks another file for an instance of the same. If it finds an entry then it adds it to another file and loops through the first list until the it has gone through the whole file. The data in the first file looks like this -
email#address.com;
email2#address.com;
and so on
The other file in which I am looking for a match and placing the match in the blank file looks like this -
12334 email#address.com;
32213 email2#address.com;
I want it to retain the numbers as well as the matching data. I have an idea of how this should work but need to know how to implement it.
My Idea
#!/bin/bash
read -p "enter first file name:" file1
read -p "enter second file name:" file2
FILE_DATA=( $( /bin/cat $file1))
FILE_DATA1=( $( /bin/cat $file2))
for I in $((${#FILE_DATA[#]}))
do
echo $FILE_DATA[$i] | grep $FILE_DATA1[$i] >> output.txt
done
I want the output to look like this but only for addresses that match -
12334 email#address.com;
32213 email2#address.com;
Thank You
quite like manipulating text using SQL:
$ cat file1
b#address.com
a#address.com
c#address.com
d#address.com
$ cat file2
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
$ join -1 1 -2 2 <(sort file1) <(sort -k2 file2) | awk '{print $2,$1}'
11457 b#address.com
22519 d#address.com
make keys sorted(we use emails as keys here)
join on keys(file1.column1, file2.column2)
format output(use awk to reverse columns)
As you've learned about diff and comm, now it's time to learn about another tool in the unix toolbox, join.
Join does just what the name indicates, it joins together 2 files. The way you join is based on keys embedded in the file.
The number 1 restraint on using join is that the data must be sorted in both files on the same column.
file1
a abc
b bcd
c cde
file2
a rec1
b rec2
c rec3
join file1 file2
a abc rec1
b bcd rec2
c cde rec3
you can consult the join man page for how to reduce and reorder the columns of output. for example
1>join -o 1.1 2.2 file1 file2
a rec1
b rec2
c rec3
You can use your code for file name input to turn this into a generalizable script.
Your solution using a pipeline inside a for loop will work for small sets of data, but as the size of data grows, the cost of starting a new process for each word you are searching for will drag down the run time.
I hope this helps.
Read line by the file1.txt file and assign the line to var ADDR. grep file2.txt with the content of var ADDR and append the output to file_result.txt.
(while read ADDR; do grep "${ADDR}" file2.txt >> file_result.txt ) < file1.txt
This awk one-liner can help you do that -
awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
NR and FNR are awk's built-in variables that stores the line numbers. NR does not get reset to 0 when working with two files. FNR does. So while that condition is true we add everything to an array a. Once the first file is completed, we check for the second column of second file. If a match is present in the array we put the entire line in a file f3.txt. If not then we ignore it.
Using data from Kev's solution:
[jaypal:~/Temp] cat f1.txt
b#address.com
a#address.com
c#address.com
d#address.com
[jaypal:~/Temp] cat f2.txt
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
[jaypal:~/Temp] awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
[jaypal:~/Temp] cat f3.txt
11457 b#address.com
22519 d#address.com

Resources