The below code searches for a set of patterns (contained in the $snps variable) within multiple files ($file variable for files ending in snp_search.txt) and outputs a long list of whether or not each snp is in each file.
The purpose is to find several SNPs that are in all of the files.
Is there a way to embed the below code in a while loop so that the it keeps running until it finds a SNP that is in all of the files and breaks when it does? Otherwise I have to check the log file manually.
for snp in $snplist; do
for file in *snp_search.txt; do
if grep -wq "$snp" $file; then
echo "${snp} was found in $file" >> ${date}_snp_search.log; else
echo "${snp} was NOT found in $file" >> ${date}_snp_search.log
fi
done
done
You can use grep to search all the files. If the file names don't contain newlines, you can just count the number of matching files directly:
#! /bin/bash
files=(*snp_search.txt)
count_files=${#files[#]}
for snp in $snplist ; do
count=$(grep -wl "$snp" *snp_search.txt | wc -l)
if ((count == count_files)) ; then
break
fi
done
For file names containing newlines, you can output the first matching line for each $snp without the file name and count the lines:
count=$(grep -m1 -hw "$snp" *snp_search.txt | wc -l)
Assumptions:
multiple SNPs may exist in a single line of an input file
will print a list of all SNPs that exist in all files (OP has mentioned contradicting statements: find several SNPs that are in all of the files vs break when one SNP is found in all files)
Sample inputs (will update if OP updates question with sample data):
$ cat snp.dat
ABC
DEF
XYZZ
$ cat 1.snp.search.txt
ABCD-XABC
someABC_stuff
ABC-
de-ABC-
de-ABC
DEFG
zDEFG
.DEF-xyz
abc-DEF
abc-DEF-ABC-xyz
$ cat 2.snp.search.txt
ABC
One GNU awk idea that requires a single pass through each input file:
awk '
FNR==NR { snps[$1]=0; next } # load 1st file into array; initialize counter (of files containing this snp) to 0
FNR==1 { filecount++ # 1st line of 2nd-nth files: increment counter of number of filds
delete to_find # delete our to_find[] array
for (snp in snps) # make a copy of our master snps[] array ...
to_find[snp] # storing copy in to_find[] array
}
{ for (snp in to_find) { # loop through list of snps
if ($0 ~ "\\y" snp "\\y") { # if current line contains a "word" match on the current snp ...
snps[snp]++ # increment our snp counter (ie, number of files containing this snp)
delete to_find[snp] # no longer need to search current file for this particular snp
# break # if line can only contain 1 snp then uncomment this line
}
}
for (snp in to_find) # if we still have an snp to find then ...
next # skip to next line else ...
nextfile # skip to next file
}
END { PROCINFO["sorted_in"]="#ind_str_asc"
for (snp in snps)
if (snps[snp] == filecount)
printf "The SNP %s was found in all files\n", snp
}
' snp.dat *.snp.search.txt
NOTES:
GNU awk is required for the PROCINFO["sorted_in"]="#ind_str_asc" option to sort the snps[] array indices; if GNU awk is not available, or ordering of output messages is not important, then this command can be removed from the code
since we only process each input file once we will print all SNPs that show up in all files (ie, we won't know if a SNP exists in all files until we've processed the last file so might as well print all SNPs that exist in all fiels)
should be faster than processes that require multiple scans of each input file (especially for larger files and/or a large number of SNPs)
This generates:
The SNP ABC was found in all files
Related
I am working on a project that require me to take some .bed in input, extract one column from each file, take only certain parameters and count how many of them there are for each file. I am extremely inexperienced with bash so I don't know most of the commands. But with this line of code it should do the trick.
for FILE in *; do cat $FILE | awk '$9>1.3'| wc -l ; done>/home/parallels/Desktop/EP_Cell_Type.xls
I saved those values in a .xls since I need to do some graphs with them.
Now I would like to take the filenames with -ls and save them in the first column of my .xls while my parameters should be in the 2nd column of my excel file.
I managed to save everything in one column with the command:
ls>/home/parallels/Desktop/EP_Cell_Type.xls | for FILE in *; do cat $FILE | awk '$9>1.3'-x| wc -l ; done >>/home/parallels/Desktop/EP_Cell_Type.xls
My sample files are:A549.bed, GM12878.bed, H1.bed, HeLa-S3.bed, HepG2.bed, Ishikawa.bed, K562.bed, MCF-7.bed, SK-N-SH.bed and are contained in a folder with those files only.
The output is the list of all filenames and the values on the same column like this:
Column 1
A549.bed
GM12878.bed
H1.bed
HeLa-S3.bed
HepG2.bed
Ishikawa.bed
K562.bed
MCF-7.bed
SK-N-SH.bed
4536
8846
6754
14880
25440
14905
22721
8760
28286
but what I need should be something like this:
Filenames
#BS
A549.bed
4536
GM12878.bed
8846
H1.bed
6754
HeLa-S3.bed
14880
HepG2.bed
25440
Ishikawa.bed
14905
K562.bed
22721
MCF-7.bed
8760
SK-N-SH.bed
28286
Assuming OP's awk program (correctly) finds all of the desired rows, an easier (and faster) solution can be written completely in awk.
One awk solution that keeps track of the number of matching rows and then prints the filename and line count:
awk '
FNR==1 { if ( count >= 1 ) # first line of new file? if line counter > 0
printf "%s\t%d\n", prevFN, count # then print previous FILENAME + tab + line count
count=0 # then reset our line counter
prevFN=FILENAME # and save the current FILENAME for later printing
}
$9>1.3 { count++ } # if field #9 > 1.3 then increment line counter
END { if ( count >= 1 ) # flush last FILENAME/line counter to stdout
printf "%s\t%d\n", prevFN, count
}
' * # * ==> pass all files as input to awk
For testing purposes I replaced $9>1.3 with /do/ (match any line containing the string 'do') and ran against a directory containing an assortment of scripts and data files. This generated the following tab-delimited output:
bigfile.txt 7
blocker_tree.sql 4
git.bash 2
hist.bash 4
host.bash 2
lines.awk 2
local.sh 3
multi_file.awk 2
I'm new to bash and need help to copy Row 2 onwards from one file into a specific position (150 characters in) in another file. Through looking through the forum, I've found a way to include specific text listed at this position:
sed -i 's/^(.{150})/\1specifictextlisted/' destinationfile.txt
However, I can't seem to find a way to copy content from one file into this.
Basically, I'm working with these 2 starting files and need the following output:
File 1 contents:
Sequence
AAAAAAAAAGGGGGGGGGGGCCCCCCCCCTTTTTTTTT
File 2 contents:
chr2
tccccagcccagccccggccccatccccagcccagcctatccccagcccagcctatccccagcccagccccggccccagccccagccccggccccagccccagccccggccccagccccggccccatccccggccccggccccatccccggccccggccccggccccggccccggccccatccccagcccagccccagccccatccccagcccagccccggcccagccccagcccagccccagccacagcccagccccggccccagccccggcccaggcccagcccca
Desired output contents:
chr2
tccccagcccagccccggccccatccccagcccagcctatccccagcccagcctatccccagcccagccccggccccagccccagccccggccccagccccagccccggccccagccccggccccatccccggccccggccccatccccgAAAAAAAAAGGGGGGGGGGGCCCCCCCCCTTTTTTTTTgccccggccccggccccggccccggccccatccccagcccagccccagccccatccccagcccagccccggcccagccccagcccagccccagccacagcccagccccggccccagccccggcccaggcccagcccca
Can anybody put me on the right track to achieving this?
If the file is really huge instead of just 327 characters you might want to use dd:
dd if=chr2 bs=1 count=150 status=none of=destinationfile.txt
tr -d '\n' < Sequence >> destinationfile.txt
dd if=chr2 bs=1 skip=150 seek=189 status=none of=destinationfile.txt
189 is 150+length of Sequence.
You can use awk for that:
awk 'NR==FNR{a=$2;next}{print $1, substr($2, 0, 149) "" a "" substr($2, 150)}' file1 file2
Explanation:
# Total row number == row number in file
# This is only true when processing file1
NR==FNR {
a=$2 # store column 2 in a variable 'a'
next # do not process the block below
}
# Because of the 'next' statement above, this
# block gets only executed for file2
{
# put 'a' in the middle of the second column and print it
print $1, substr($2, 0, 149) "" a "" substr($2, 150)
}
I assume that both files contain only a single line, like in your example.
Edit: In comments you said that the files actually spread two lines, in that case you can use the following awk script:
# usage: awk -f this_file.awk file1 file2
# True for the second line in each file
FNR==2 {
# Total line number equals line number in file
# This is only true while we are processing file1
if(NR==FNR) {
insert=$0 # Store the string to be inserted in a variable
} else {
# Insert the string in file1
# Assigning to $0 will modify the current line
$0 = substr($0, 0, 149) "" insert "" substr($0, 150)
}
}
# Print lines of file2 (line 2 has been modified above)
NR!=FNR
You can use bash and read one char at the time from the file:
i=1
while read -n 1 -r; do
echo -n "$REPLY"
let i++
if [ $i -eq 150 ]; then
echo -n "AAAAAAAAAGGGGGGGGGGGCCCCCCCCCTTTTTTTTT"
fi
done < chr2 > destinationfile.txt
This simply reads a char, echos it and increments the counter. If the counter is 150 it echos your sequence. You can replace the echo with a cat file | tr -d '\n'. Just make sure to remove any newlines, like here with tr. That is also why I use echo -n so it doesn't add any.
I have a file named "compare" and a file named "final_contigs_c10K.fa"
I want to eleminate lines AND THE NEXT LINE from "final_contigs_c10K.fa" containing specific strings in "compare".
compare looks like this :
k119_1
k119_3
...
and the number of lines of compare is 26364.
final_contigs_c10K.fa looks like :
>k119_1
AAAACCCCC
>k119_2
CCCCC
>k119_3
AAAAAAAA
...
I want to make make final_contigs_c10K.fa into a format :
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
...
I tried this code, but this code takes too much time, though it seems to be working fine. I think it takes too much time because the number of lines in compare is 26364, which is too much compared to my other files that I had tested the code on.
while read line; do sed -i -e "/$line/ { N; d; }" final_contigs_c10K.fa; done < compare
Is there a way to make this command faster?
Using awk
$ awk 'NR==FNR{a[">" $1];next}$1 in a{p=3} --p>0' compare final_contigs_c10K.fa
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
This will produce the output to stdout ie. won't make any changes to original files.
Explained:
$ awk '
NR==FNR { # process the first file
a[">" $1] # hash to a, adding > while at it
next # process the next record
} # process th second file after this point
$1 in a { p=3 } # if current record was in compare file set p
--p>0 # print current file match and the next record
' compare final_contigs_c10K.fa # mind the file order
I have two data files. One is having 1600 rows and the other one is having 2 million rows(tab delimited files). I need to vlookup between these two files. Please see below example for the expected output and kindly let me know if it's possible. I've tried using awk, but couldn't get the expected result.
File 1(small file)
BC1 10 100
BC2 20 200
BC3 30 300
File 2(large file)
BC1 XYZ
BC2 ABC
BC3 DEF
Expected Output:
BC1 10 100 XYZ
BC2 20 200 ABC
BC3 30 300 DEF
I also tried the join command. It is taking forever to complete. Please help me find a solution. Thanks
Commands for your output:
awk '{print $1}' *file | sort | uniq -d > out.txt
for i in $(cat out.txt)
do
grep "$i" large_file >> temp.txt
done
sort -g -t 1 temp.txt > out1.txt
sort -g -t 1 out.txt > out2.txt
paste out1.txt out2.txt | awk '{print $1 $2 $3 $5}'
Commands for Vlookup
Store 1st and 2nd column in file1 file2 respectively
cat file1 file2 | sort | uniq -d ### for records which are present in both files
cat file1 file2 | sort | uniq -u ### for records which are unique and not present in bulk file
This awk script will scan line by line each file, and will try to match the number in the BC column. Once matched, it will print all the columns.
If one of the files does not contain one of the numbers, it will be skipped in both files and will search for the next one. It will loop until one of the files ends.
The script also accepts any number of columns per file and any number of files, as long as the first column is BC and a number.
This awk script assumes that the files are ordered from minor to major number in the BC column (like in your example). Otherwise it will not work.
To execute the script, run this command:
awk -f vlookup.awk smallfile bigfile
The vlookup.awk file will have this content:
BEGIN {files=1;lines=0;maxlines=0;filelines[1]=0;
#Number of columns for SoD, PRN, reference file
col_bc=1;
#Initialize variables
bc_now=0;
new_bc=0;
end_of_process=0;
aux="";
text_result="";
}
{
if(FILENAME!=ARGV[1])exit;
no_bc=0;
new_bc=0;
#Save number of columns
NFields[1]=NF;
#Copy reference file data
for(j=0;j<=NF;j++)
{
file[1,j]=$j;
}
#Read lines from file
for(i=2;i<ARGC;i++)
{
ret=getline < ARGV[i];
if(ret==0) exit; #END OF FILE reached
#Copy columns to file variable
for(j=0;j<=NF;j++)
{
file[i,j]=$j;
}
#Save number of columns
NFields[i]=NF;
}
#Check that all files are in the same number
for(i=1;i<ARGC;i++)
{
bc[i]=file[i,col_bc];
bc[i]=sub("BC","",file[i,col_bc]);
if(bc[i]>bc_now) {bc_now=bc[i];new_bc=1;}
}
#One or more files have a new number
if (new_bc==1)
{
for(i=1;i<ARGC;i++)
{
while(bc_now!=file[i,col_bc])
{
#Read next line from file
if(i==1) ret=getline; #File 1 is the reference file
else ret=getline < ARGV[i];
if(ret==0) exit; #END OF FILE reached
#Copy columns to file variable
for(j=0;j<=NF;j++)
{
file[i,j]=$j;
}
#Save number of columns
NFields[i]=NF;
#Check if in current file data has gone to next number
if(file[i,col_bc]>bc_now)
{
no_bc=1;
break;
}
#No more data lines to compare, end of comparison
if(FILENAME!=ARGV[1])
{
exit;
}
}
#If the number is not in a file, the process to realign must be restarted to the next number available (Exit for loop)
if (no_bc==1) {break;}
}
#If the number is not in a file, the process to realign must be restarted to the next number available (Continue while loop)
if (no_bc==1) {next;}
}
#Number is aligned
for(i=1;i<ARGC;i++)
{
for(j=2;j<=NFields[i];j++) {
#Join colums in text_result variable
aux=sprintf("%s %s",text_result,file[i,j]);
text_result=sprintf("%s",aux);
}
}
printf("BC%d%s\n",bc_now,text_result)
#Reset text variables
aux="";
text_result="";
}
I also tried the join command. It is taking forever to complete.
Please help me find a solution.
It's improbable that you'll find a solution (scripted or not) that's faster than the compiled join command. If you can't wait for join to complete, you need more powerful hardware.
I am serching for a compact/elegant solution for this problem in linux shell (ksh if possible).
Given 2 files, both containing lines with a constant structure, eg:
file A
354guitar..06
948banjo...05
123ukulele.04
file B
354bass....04
948banjo...04
i would like to loop someway on file A, and search for lines in file B having same content in position 4-11, but different content in position 12-13.
For case above i would expect second line of file B as output, having "banjo..." matching second line of file A and 05!=04.
I was thinking to use awk, but can't find a solution by myself :(
Thanks!
Really simple with awk:
$ awk '{a=substr($0,4,8);b=substr($0,12,2)}NR==FNR{c[a]=b;next}a in c&&c[a]!=b' fileA fileB
948banjo...04
Or in a more readable format, you can saver the following in a script name file.awk
#!/bin/awk -f
{ # This is executed for every input line (both files)
a=substr($0,4,8) # put characters 4 through 11 to variable a
b=substr($0,12,2) # put characters 12 and 13 to variable b
}
NR==FNR{ # This is executed only for the first file
c[a]=b # store into map c index a, value b
next # Go to the next record (remaining commands ignored)
}
# The remaining is only executed for the second file (due to the next command)
(a in c) && (c[a] != b) # if a is an index of the map c, and the value
# we previously stored is not the same as the current b value
# then print the current line (this is the default acttion)
and execute like:
awk -f file.awk fileA fileB
You could use a zsh one-liner such as this one :
for line in `cat fileA`; do grep '^\d\{3\}$line[4,11]' fileB | grep -v '$line[12,14]$'; done