search in one file given patterns from another one

search in one file given patterns from another one - linux

I am serching for a compact/elegant solution for this problem in linux shell (ksh if possible).
Given 2 files, both containing lines with a constant structure, eg:
file A
354guitar..06
948banjo...05
123ukulele.04
file B
354bass....04
948banjo...04
i would like to loop someway on file A, and search for lines in file B having same content in position 4-11, but different content in position 12-13.
For case above i would expect second line of file B as output, having "banjo..." matching second line of file A and 05!=04.
I was thinking to use awk, but can't find a solution by myself :(
Thanks!

Really simple with awk:
$ awk '{a=substr($0,4,8);b=substr($0,12,2)}NR==FNR{c[a]=b;next}a in c&&c[a]!=b' fileA fileB
948banjo...04
Or in a more readable format, you can saver the following in a script name file.awk
#!/bin/awk -f
{ # This is executed for every input line (both files)
a=substr($0,4,8) # put characters 4 through 11 to variable a
b=substr($0,12,2) # put characters 12 and 13 to variable b
}
NR==FNR{ # This is executed only for the first file
c[a]=b # store into map c index a, value b
next # Go to the next record (remaining commands ignored)
}
# The remaining is only executed for the second file (due to the next command)
(a in c) && (c[a] != b) # if a is an index of the map c, and the value
# we previously stored is not the same as the current b value
# then print the current line (this is the default acttion)
and execute like:
awk -f file.awk fileA fileB

You could use a zsh one-liner such as this one :
for line in `cat fileA`; do grep '^\d\{3\}$line[4,11]' fileB | grep -v '$line[12,14]$'; done

Related

While loop to break when pattern is found in all files?

The below code searches for a set of patterns (contained in the $snps variable) within multiple files ($file variable for files ending in snp_search.txt) and outputs a long list of whether or not each snp is in each file.
The purpose is to find several SNPs that are in all of the files.
Is there a way to embed the below code in a while loop so that the it keeps running until it finds a SNP that is in all of the files and breaks when it does? Otherwise I have to check the log file manually.
for snp in $snplist; do
for file in *snp_search.txt; do
if grep -wq "$snp" $file; then
echo "${snp} was found in $file" >> ${date}_snp_search.log; else
echo "${snp} was NOT found in $file" >> ${date}_snp_search.log
fi
done
done

You can use grep to search all the files. If the file names don't contain newlines, you can just count the number of matching files directly:
#! /bin/bash
files=(*snp_search.txt)
count_files=${#files[#]}
for snp in $snplist ; do
count=$(grep -wl "$snp" *snp_search.txt | wc -l)
if ((count == count_files)) ; then
break
fi
done
For file names containing newlines, you can output the first matching line for each $snp without the file name and count the lines:
count=$(grep -m1 -hw "$snp" *snp_search.txt | wc -l)

Assumptions:
multiple SNPs may exist in a single line of an input file
will print a list of all SNPs that exist in all files (OP has mentioned contradicting statements: find several SNPs that are in all of the files vs break when one SNP is found in all files)
Sample inputs (will update if OP updates question with sample data):
$ cat snp.dat
ABC
DEF
XYZZ
$ cat 1.snp.search.txt
ABCD-XABC
someABC_stuff
ABC-
de-ABC-
de-ABC
DEFG
zDEFG
.DEF-xyz
abc-DEF
abc-DEF-ABC-xyz
$ cat 2.snp.search.txt
ABC
One GNU awk idea that requires a single pass through each input file:
awk '
FNR==NR { snps[$1]=0; next } # load 1st file into array; initialize counter (of files containing this snp) to 0
FNR==1 { filecount++ # 1st line of 2nd-nth files: increment counter of number of filds
delete to_find # delete our to_find[] array
for (snp in snps) # make a copy of our master snps[] array ...
to_find[snp] # storing copy in to_find[] array
}
{ for (snp in to_find) { # loop through list of snps
if ($0 ~ "\\y" snp "\\y") { # if current line contains a "word" match on the current snp ...
snps[snp]++ # increment our snp counter (ie, number of files containing this snp)
delete to_find[snp] # no longer need to search current file for this particular snp
# break # if line can only contain 1 snp then uncomment this line
}
}
for (snp in to_find) # if we still have an snp to find then ...
next # skip to next line else ...
nextfile # skip to next file
}
END { PROCINFO["sorted_in"]="#ind_str_asc"
for (snp in snps)
if (snps[snp] == filecount)
printf "The SNP %s was found in all files\n", snp
}
' snp.dat *.snp.search.txt
NOTES:
GNU awk is required for the PROCINFO["sorted_in"]="#ind_str_asc" option to sort the snps[] array indices; if GNU awk is not available, or ordering of output messages is not important, then this command can be removed from the code
since we only process each input file once we will print all SNPs that show up in all files (ie, we won't know if a SNP exists in all files until we've processed the last file so might as well print all SNPs that exist in all fiels)
should be faster than processes that require multiple scans of each input file (especially for larger files and/or a large number of SNPs)
This generates:
The SNP ABC was found in all files

How to make while read faster (how to use grep instead)

I have a file named "compare" and a file named "final_contigs_c10K.fa"
I want to eleminate lines AND THE NEXT LINE from "final_contigs_c10K.fa" containing specific strings in "compare".
compare looks like this :
k119_1
k119_3
...
and the number of lines of compare is 26364.
final_contigs_c10K.fa looks like :
>k119_1
AAAACCCCC
>k119_2
CCCCC
>k119_3
AAAAAAAA
...
I want to make make final_contigs_c10K.fa into a format :
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
...
I tried this code, but this code takes too much time, though it seems to be working fine. I think it takes too much time because the number of lines in compare is 26364, which is too much compared to my other files that I had tested the code on.
while read line; do sed -i -e "/$line/ { N; d; }" final_contigs_c10K.fa; done < compare
Is there a way to make this command faster?

Using awk
$ awk 'NR==FNR{a[">" $1];next}$1 in a{p=3} --p>0' compare final_contigs_c10K.fa
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
This will produce the output to stdout ie. won't make any changes to original files.
Explained:
$ awk '
NR==FNR { # process the first file
a[">" $1] # hash to a, adding > while at it
next # process the next record
} # process th second file after this point
$1 in a { p=3 } # if current record was in compare file set p
--p>0 # print current file match and the next record
' compare final_contigs_c10K.fa # mind the file order

Output only the first pattern-line and its following line

I need to filter the output of a command.
I tried this.
bpeek | grep nPDE
My problem is that I need all matches of nPDE and the line after the found file. So the output would be like:
iteration nPDE
1 1
iteration nPDE
2 4
The best case would be if it would show me the found line only once and then only the line after it.
I found solutions with awk, But as far as I know awk can only read files.

There is an option for that.
grep --help
...
-A, --after-context=NUM print NUM lines of trailing context
Therefore:
bpeek | grep -A 1 'nPDE'

With awk (for completeness since you have grep and sed solutions):
awk '/nPDE/{c=2} c&&c--'

grep -A works if your grep supports it (it's not in POSIX grep). If it doesn't, you can use sed:
bpeek | sed '/nPDE/!d;N'
which does the following:
/nPDE/!d # If the line doesn't match "nPDE", delete it (starts new cycle)
N # Else, append next line and print them both
Notice that this would fail to print the right output for this file
nPDE
nPDE
context line
If you have GNU sed, you can use an address range as follows:
sed '/nPDE/,+1!d'
Addresses of the format addr1,+N define the range between addr1 (in our case /nPDE/) and the following N lines. This solution is easier to adapt to a different number of context lines, but still fails with the example above.
A solution that manages cases like
blah
nPDE
context
blah
blah
nPDE
nPDE
context
nPDE
would like like
sed -n '/nPDE/{$p;:a;N;/\n[^\n]*nPDE[^\n]*$/!{p;b};ba}'
doing the following:
/nPDE/ { # If the line matches "nPDE"
$p # If we're on the last line, just print it
:a # Label to jump to
N # Append next line to pattern space
/\n[^\n]*nPDE[^\n]*$/! { # If appended line does not contain "nPDE"
p # Print pattern space
b # Branch to end (start new loop)
}
ba # Branch to label (appended line contained "nPDE")
}
All other lines are not printed because of the -n option.
As pointed out in Ed's comment, this is neither readable nor easily extended to a larger amount of context lines, but works correctly for one context line.

How to remove the lines which appear on file B from another file A?

I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.

If the files are sorted (they are in your example):
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...
See the man page here

grep -Fvxf <lines-to-remove> <all-lines>
works on non-sorted files (unlike comm)
maintains the order
is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
GitHub upstream.
usage:
remove-lines lines-to-remove remove-from-this-file
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another

awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...

Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)

You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format is for lines that are in file b but not in a
--old-.. is for lines that are in file a but not in b
--unchanged-.. is for lines that are in both.
%L makes it so the line is printed exactly.
man diff
for more details

This refinement of #karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup="$LOOKUP" '
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)

You can use Python:
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("file A", "r") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
'

You can use -
diff fileA fileB | grep "^>" | cut -c3- > fileA
This will work for files that are not sorted as well.

Just to add to the Python answer to the user above, here is a faster solution:
python -c '
lines_to_remove = None
with open("partial file") as f:
lines_to_remove = {line.rstrip() for line in f.readlines()}
remaining_lines = None
with open("full file") as f:
remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove
with open("output file", "w") as f:
for line in remaining_lines:
f.write(line + "\n")
'
Raising the power of set subtraction.

To get the file after removing the lines which appears on another file
comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt

Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.
lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt

To remove common lines between two files you can use grep, comm or join command.
grep only works for small files. Use -v along with -f.
grep -vf file2 file1
This displays lines from file1 that do not match any line in file2.
comm is a utility command that works on lexically sorted files. It
takes two files as input and produces three text columns as output:
lines only in the first file; lines only in the second file; and lines
in both files. You can suppress printing of any column by using -1, -2
or -3 option accordingly.
comm -1 -3 file2 file1
This displays lines from file1 that do not match any line in file2.
Finally, there is join, a utility command that performs an equality
join on the specified files. Its -v option also allows to remove
common lines between two files.
join -v1 -v2 file1 file2

How can I swap two lines using sed?

Does anyone know how to replace line a with line b and line b with line a in a text file using the sed editor?
I can see how to replace a line in the pattern space with a line that is in the hold space (i.e., /^Paco/x or /^Paco/g), but what if I want to take the line starting with Paco and replace it with the line starting with Vinh, and also take the line starting with Vinh and replace it with the line starting with Paco?
Let's assume for starters that there is one line with Paco and one line with Vinh, and that the line Paco occurs before the line Vinh. Then we can move to the general case.

#!/bin/sed -f
/^Paco/ {
:notdone
N
s/^\(Paco[^\n]*\)\(\n\([^\n]*\n\)*\)\(Vinh[^\n]*\)$/\4\2\1/
t
bnotdone
}
After matching /^Paco/ we read into the pattern buffer until s// succeeds (or EOF: the pattern buffer will be printed unchanged). Then we start over searching for /^Paco/.

cat input | tr '\n' 'ç' | sed 's/\(ç__firstline__\)\(ç__secondline__\)/\2\1/g' | tr 'ç' '\n' > output
Replace __firstline__ and __secondline__ with your desired regexps. Be sure to substitute any instances of . in your regexp with [^ç]. If your text actually has ç in it, substitute with something else that your text doesn't have.

try this awk script.
s1="$1"
s2="$2"
awk -vs1="$s1" -vs2="$s2" '
{ a[++d]=$0 }
$0~s1{ h=$0;ind=d}
$0~s2{
a[ind]=$0
for(i=1;i<d;i++ ){ print a[i]}
print h
delete a;d=0;
}
END{ for(i=1;i<=d;i++ ){ print a[i] } }' file
output
$ cat file
1
2
3
4
5
$ bash test.sh 2 3
1
3
2
4
5
$ bash test.sh 1 4
4
2
3
1
5
Use sed (or not at all) for only simple substitution. Anything more complicated, use a programming language

A simple example from the GNU sed texinfo doc:
Note that on implementations other than GNU `sed' this script might
easily overflow internal buffers.
#!/usr/bin/sed -nf
# reverse all lines of input, i.e. first line became last, ...
# from the second line, the buffer (which contains all previous lines)
# is *appended* to current line, so, the order will be reversed
1! G
# on the last line we're done -- print everything
$ p
# store everything on the buffer again
h

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

search in one file given patterns from another one - linux

You could use a zsh one-liner such as this one : for line in `cat fileA`; do grep '^\d\{3\}$line[4,11]' fileB | grep -v '$line[12,14]$'; done

Related

While loop to break when pattern is found in all files?

How to make while read faster (how to use grep instead)

Output only the first pattern-line and its following line

How to remove the lines which appear on file B from another file A?

How can I swap two lines using sed?

Categories

Resources