concatenate files awk/linux - linux

I have n files in a folder which starts with lines as shown below.
##contig=<ID=chr38,length=23914537>
##contig=<ID=chrX,length=123869142>
##contig=<ID=chrMT,length=16727>
##samtoolsVersion=0.1.19-44428cd
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT P922_120
chr1 412573 SNP74 A C 2040.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;DP=58;
chr1 602567 BICF2G630707977 A G 877.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;
chr1 604894 BICF2G630707978 A G 2044.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;
chr1 693376 . GCCCCC GCCCC 761.73 . AC=2;AC1=2;AF=1.00;AF1=1;
There are n such files. I want to concatenate all the files into a single file such that all the lines begining with # should be deleted from all the files and concatenate the rest of the rows from all the files only retaining the header line. Example output is shown below:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT P922_120
chr1 412573 SNP74 A C 2040.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;DP=58;
chr1 602567 BICF2G630707977 A G 877.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;
chr1 604894 BICF2G630707978 A G 2044.77 PASS AC=2;AC1=2;AF=1.00;AF1=1;AN=2;DB;
chr1 693376 . GCCCCC GCCCC 761.73 . AC=2;AC1=2;AF=1.00;AF1=1;

Specifically with awk:
awk '$0!~/^#/{print $0}' file1 file2 file3 > outputfile
Broken down you are checking if the line ($0) does not match (!~) a string beginning with # (/^#/) and if so, print the line. You take input files and write to (>) outputfile.

Your problem is not terribly well specified, but I think you are just looking for:
sed '/^##/d' $FILE_LIST > output
Where FILE_LIST is the list of input files( you may be able to use *)

If I understood correctly, you could do:
echo "#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT P922_120" > mergedfile
for file in $FILES; do cat $file | grep -v "#" >> mergedfile; done
Note that $FILES could be ls and the -v option in grep is the non-match flag.

I believe what you want is
awk '$0 ~/^##/ { next; } $0 ~ /^#/ && !printed_header {print; printed_header=1 } $0! ~ /^#/ {print }' file1 file2 file3

Or you can use grep like this:
grep -vh "^##" *
The -v means inverted, so the command means... look for all lines NOT starting ## in all files and don't print filenames (-h).
Or, if you want to emit 1 header line at the start,
(grep -m1 ^#CHROM * ; grep -hv ^## * ) > out.txt

Related

Check if a word from one file exists in another file and print the matching line

I have a file which is having some specific words. I have another file having the URLs which contains that word from file1.
I would like to print url if each word in file1 matches with file2. If word is not found in file2 then return "no matching"
I tried with Awk and grep and used if conditions also. But did not get expected results.
File1:
abc
Def
XYZ
File2:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
Output can be like:
abc:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Xyz:
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Etc..
Tried:
file=/bin/file1.txt
for i in `cat $file1`;
do
a=$i
echo "$a:" | awk '$repos.txt ~ $a {printf $?}'
done
Tried some other ways like if condition with grep and all... but no luck.
abc means it should only search for abc, not abcd.
You appear to want case-insensitive matching.
An awk solution:
$ cat <<'EOD' >file1
abc
Def
XYZ
missing
EOD
$ cat <<'EOD' >file2
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
EOD
$ awk '
# create lowercase versions
{
lc = tolower($0)
}
# loop over lines of file1
# store search strings in array
# key is search string, value will be results found
NR==FNR {
h[lc]
next
}
# loop over lines of file2
# if search string found, append line to results
{
for (s in h)
if (lc ~ s)
h[s] = h[s]"\n"$0
}
# loop over seearch strings and print results
# if no result, show error message
END {
for (s in h)
print s":"( h[s] ? h[s] : "\nno matching" )
}
' file1 file2
missing:
no matching
def:
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
abc:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
xyz:
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
$
Your attempt is pretty far from the mark. Probably learn the basics of the shell and Awk before you proceed.
Here is a simple implementation which avoids reading lines with for.
while IFS='' read -r word; do
echo "$word:"
grep -F "$word" File2
done <File1
If you want to match case-insensitively, use grep -iF.
The requirement to avoid substring matches is a complication. The -w option to grep nominally restrics matching to entire words, but the definition of "word" characters includes the underscore character, so you can't use that directly. A manual approximation might look like
grep -iE "(^|[^a-z])$word([^a-z]|$)" File2
but this might not work with all grep implementations.
A better design is perhaps to prefix the match(es) before each output line, and only loop over the input file once.
awk 'NR==FNR { w[a] = "(^|[^a-z])" $0 "([^a-z]|$)"; next }
{ m = ""
for (a in w) if ($0 ~ w[a]) m = m (m ? "," : "") a
if (m) print m ":" $0 }' File1 File2
In brief, we collect the search words in the array w from the first input file. When reading the second input file, we collect matches on all the search words in m; if m is non-empty, we print its value followed by the input line which matched.
Again, if you want case-insensitive matching, use tolower() where appropriate.
Demo, featuring lower-case comparisons: https://ideone.com/iTWpFn

Finding matches in 2 files and printing the field above the match

File1:
2987571 2988014
4663633 4668876
4669084 4669827
4669873 4670130
4670212 4670604
4670604 4672469
4672502 4672621
4672723 4673088
4673102 4673518
4673521 4673895
4679698 4680174
5756724 5757680
5757937 5758506
5758855 5759202
5759940 5771528
5772524 5773063
5773005 5773106
5773063 5773452
5773486 5773776
5773836 5774189
File2:
gene complement(6864294..6865061)
/locus_tag="HCH_06747"
CDS complement(6864294..6865061)
/locus_tag="HCH_06747"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABC33372.1"
/translation="MIKQLVRPLFTGKGPNFSELSAKECGVGEYQLRYKLPGNTIHIG
MPDAPVPARVNLNADLFDSYGPKKLYNRTFVQMEFEKWAYKGRFLQGDSGLLSKMSLH
IDVNHAERHTEFRKGDLDSLELYLKKDLWNYYETERNIDGEQGANWEARYEFDHPDEM
RAKGYVPPDTLVLVRLPEIYERAPINGLEWLHYQIRGEGIPGPRHTFYWVYPMTDSFY
LTFSFWMTTEIGNRELKVQEMYEDAKRIMSMVELRKE"
gene complement(6865197..6865964)
/locus_tag="HCH_06748"
CDS complement(6865197..6865964)
/locus_tag="HCH_06748"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="ABC33373.1"
/translation="MIKQIVRPLFTGKGPNFSELNVKECGIGDYLLRYKLPGNTIDIG
MPDAPVPSRVNLNADLFDSYDPKKLYNRTFVQMEFEWWAYRGLFLQGDSGLLSKMSLH
IDVNRINPNSPLGGSDLESLETYLREDYWDYYEAEKNIDGVPGSNWQKRYDFDNPDEV
RAKGYIPVRRLVLVLLPEIYVKERINDVEWLHYSIDGEGIAGTNITYYWAYPLTNNYY
LTFSFRTTTELGRNEQRYQRMLEDAKQIMSMVELCKG"
gene complement(6865961..6867109)
/locus_tag="HCH_06749"
CDS complement(6865961..6867109)
The goal here is to take each number of the 1st file's 1st column and see if that number appears in the second file. If yes, I want to print the line right above the match in the file2: "/locus_tag"
For example, if in file1 we have 6864294, and this number is also present on file2, then I'd like to print: /locus_tag="HCH_06747"
Here's a rough sample:
awk '
NR==FNR { # hash file 1 to a
a[$1]
next
}
{
q=$0
while(match($0,/[0-9]+/)) { # find all numeric strings
if((substr($0,RSTART,RLENGTH) in a)) # test if it is in a
print p # and output previous record p
$0=substr($0,RSTART+RLENGTH) # remove match from record
}
p=q # store current record to p
}' file1 file2
/locus_tag="HCH_06747"
Tried this and I think it will work:
for i in $(cat file1 | awk -F " " '{print $1 '\n'; print $2}')
do
grep -m1 $i file2 -A1 | tail -1
done

find matching patterns in files linux

I am trying to find matching strings between 2 files.
for example:
file 1:
A2M,0.00351888
A2M-AS1,0.00131091
A3GALT2,0.00966505
A4GALT,0.108364
AACS,0.0830823
AACSP1,0.00264056
AADACL2-AS1,0.0318584
AADACL4,0.00384096
AAED1,0.216966
file 2:
chr1 33772366 33786699 A3GALT2 1 -
chr22 43088126 43116876 A4GALT 1 -
chr12 125549924 125627871 AACS 1 +
chr5 178191863 178203277 AACSP1 1 -
chr1 12704565 12727097 AADACL4 1 +
chr9 99403532 99417599 AAED1 1 -
chr8 117950463 117956239 AARD 1 +
chr7 121713597 121784344 AASS 1 -
chr7 48211056 48687091 ABCA13 1 +
chr1 94458393 94586705 ABCA4 1 -
chr17 66970772 67057136 ABCA9 1 -
I want to extract the lines in file 2 that their 4th column is equal to the first column in file 1.
I wrote this command for it:
cat file | cut -d ',' -f1 | grep -wFf - file2 > match_file
But when it has another character - not [a-z] like: APCDD1L-AS1.
It takes only the APCDD1L and gives incorect results.
I read that grep -w works only with "real" words, so I guess this is the problem.
How can I fix it? (find the whole matching string)
Using awk:
$ awk 'NR==FNR{a[$1];next}($4 in a)' FS="," file1 FS=" +" file2
chr1 33772366 33786699 A3GALT2 1 -
chr22 43088126 43116876 A4GALT 1 -
chr12 125549924 125627871 AACS 1 +
chr5 178191863 178203277 AACSP1 1 -
chr1 12704565 12727097 AADACL4 1 +
chr9 99403532 99417599 AAED1 1 -
I assumed that file2 is space separated, FS=" +". If it is in fact tab separated, set FS="\t"instead.
There is nothing in your data samples implying I can not simply grep any column as only one of them contains alphanumeric characters with that format. If thats the case this will do ( Bash compatible ) :
#!/bin/bash
rm -f matched_output.txt
patterns=$( awk -F',' '{ print $1 }' Matching_patterns.txt )
while read pattern
do
printf "Attempting $pattern"
grep -F "$pattern" mytext.txt >> matched_output.txt && printf " - Success! \n" || printf " - Failed \n"
done <<< "$patterns"
Input files
Script running
Output file
Hope this is useful for you! Regards!
You can try this, let you avoid troubles with special symbols in names
firsts=( `cat f1 | cut -d',' -f1` ); for lines in ${firsts[#]}; do grep "${lines}" f2 >>output; done

Looping through a table and append information of that table to another file

This is my first post and im fairly new to bash coding.
We ran some Experiments where i work and for plotting it in gnuplot we need to append a reaction label to a Result.
We have a file that looks like this:
G135b CH2O+HCO=O2+C2H3
R020b 2CO+H=OH+C2O
R021b 2CO+O=O2+C2O
and a Result-file (which i cant access right now, sorry) where the first column of shown file is the same, followed by multiple values. They are not in the same order.
Now i want to loop through the Result-file and take the value of the first column, search for it in the shown file and append the reactionlabel to that line.
How can i loop through all the lines of the resulting file and take the value of the first column in a temporary variable?
I want to use this variable like this:
grep -r '^$var' shownfile | awk '{print $2}'
(Gives something back like this: CH2O+HCO=O2+C2H3)
How can i append the result of that line to the Result-file?
Edit: I also wrote a Script to go from a file that looks like this:
G135b : 0.178273 C H 2 O + H C O = O 2 + C 2 H 3
to this:
G135b CH2O+HCO=O2+C2H3
which is:
#!/bin/bash
file=$(pwd)
cd $file
# echo "$file"
cut -f1,3 $file/newfile >>tmpfile
sed -i "s/://g" tmpfile
sed -i "s/ //g" tmpfile
cp tmpfile newfile
How do i execute the cut command inside a file? Like -i for sed. My workaround is pretty ugly because it creates another file in the current directory.
Thank you :)
join command would work here which would perform inner join on 2 files wrt 1st column of each(by default).
$ cat data
G135b CH2O+HCO=O2+C2H3
R020b 2CO+H=OH+C2O
R021b 2CO+O=O2+C2O
$ cat result_file
G135b a b c
R020b a b
R021b a b x y z
$ join data result_file
G135b CH2O+HCO=O2+C2H3 a b c
R020b 2CO+H=OH+C2O a b
R021b 2CO+O=O2+C2O a b x y z
Using awk, it would be something like:
NR == FNR { data[$1] = $2; next; }
{ print $0 " " data[$1]; }
Save that in a file called reactions.awk, then call awk -f reactions.awk shownfile resultfile.
awk '{a[$1]=a[$1]$2} END{for (i in a){print i,a[i]}}' file1 file2

Searching for text

I'm trying to write a shell script that searches for text within a file and prints out the text and associated information to a separate file.
From this file containing list of gene IDs:
DDIT3 ENSG00000175197
DNMT1 ENSG00000129757
DYRK1B ENSG00000105204
I want to search for these gene IDs (ENSG*), their RPKM1 and RPKM2 values in a gtf file:
chr16 gencodeV7 gene 88772891 88781784 0.126744 + . gene_id "ENSG00000174177.7"; transcript_ids "ENST00000453996.1,ENST00000312060.4,ENST00000378384.3,"; RPKM1 "1.40735"; RPKM2 "1.61345"; iIDR "0.003";
chr11 gencodeV7 gene 55850277 55851215 0.000000 + . gene_id "ENSG00000225538.1"; transcript_ids "ENST00000425977.1,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";
and print/ write it to a separate output file
Gene_ID RPKM1 RPKM2
ENSG00000108270 7.81399 8.149
ENSG00000101126 12.0082 8.55263
I've done it on the command line using for each ID using:
grep -w "ENSGno" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' > output.file
but when it comes to writing the shell script, I've tried various combinations of for, while, read, do and changing the variables but without success. Any ideas would be great!
You can do something like:
while read line
do
var=$(echo $line | awk '{print $2}')
grep -w "$var" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' >> output.file
done < geneIDs.file

Resources