grep is unable to match contents of 1 file in another file - linux

Grep is unable to search contents of 1 file in the other file, Dont know what is wrong.
have 1 file called mine having contents like
sadiadas
HTTP:STC:ACTIVEX:MCAFEE-FREESCN 
HTTP:STC:IMG:ANI-BLOCK-STR2 
HTTP:STC:ADOBE:PDF-LIBTIFF 
HTTP:STC:ADOBE:PS-PNG-BO 
HTTP:STC:DL:EOT-IO 
HTTP:STC:IE:CLIP-MEM 
HTTP:STC:DL:XLS-DATA-INIT 
HTTP:STC:ADOBE:FLASH-RUNTIME 
HTTP:STC:ADOBE:FLASH-ARGREST 
HTTP:STC:DL:MS-NET-CLILOADER-MC 
HTTP:ORACLE:COREL-DRAW-BO 
HTTP:STC:MS-FOREFRONT-RCE 
HTTP:STC:DL:VISIO-UMLSTRING 
HTTP:ORACLE:OUTSIDEIN-CORELDRAW 
HTTP:STC:DL:MAL-M3U 
HTTP:STC:JAVA:MIXERSEQ-OF 
HTTP:STC:DL:MAL-WEBEX-WRF 
HTTP:STC:DL:XLS-FORMULA-BIFF 
HTTP:STC:JAVA:TYPE1-FONT 
HTTP:STC:DL:XLS-FIELD-MC 
HTTP:STC:IE:AUTH-REFLECTION 
HTTP:STC:DL:MOZILLA-WAV-BOF 
HTTP:XSS:PHPNUKE-BOOKMARKS1 
HTTP:STC:DL:MAL-WIN-BRIEFCASE-2 
HTTP:STC:ADOBE:FLASH-INT-OV 
HTTP:STC:IE:MAL-GIF-DOS 
APP:NOVELL:GWMGR-INFODISC 
APP:SYMC:MESSAGING-SAVE.DO-CSRF 
HTTP:STC:ADOBE:READER-MC-RCE 
HTTP:STC:DL:SOPHOS-RAR-VMSF-RGB 
HTTP:ORACLE:OUTSIDE-IN-PRDOX-BO 
HTTP:STC:JAVA:IBM-RMI-PROXY-RCE  
HTTP:STC:IE:REMOVECHILD-UAF 
HTTP:STC:COREL-WP-BOF 
SHELLCODE:MSF:PROPSPRAY 
HTTP:VLC-ABC-FILE-BOF 
HTTP:MISC:MS-XML-SIG-VAL-DOS 
HTTP:STC:ADOBE:FLASH-PLAYER-BOF 
HTTP:STC:ADOBE:FLASHPLR-FILE-MC 
HTTP:STC:ADOBE:FLASH-AS3-INT-OV 
HTTP:ORACLE:OUTSIDE-IN-MSACCESS 
HTTP:STC:SCRIPT:APACHE-XML-DOS 
HTTP:STC:JAVA:METHODHANDLE 
HTTP:STC:ADOBE:CVE-2014-0506-UF 
HTTP:STC:IE:CVE-2014-1789-MC 
HTTP:STC:ACTIVEX:KVIEW-KCHARTXY 
SHELLCODE:X86:LIN-SHELL-REV-80S 
HTTP:STC:JAVA:JRE-PTR-CTRL-EXEC 
HTTP:STC:ADOBE:CVE-2015-0091-CE 
HTTP:DOS:MUL-PRODUCTS 
HTTP:MISC:WAPP-SUSP-FILEUL1 
SHELLCODE:X86:BASE64-NOOP-80C 
SHELLCODE:X86:BASE64-NOOP-80S 
SHELLCODE:X86:REVERS-CONECT-80C 
SHELLCODE:X86:REVERS-CONECT-80S 
SHELLCODE:X86:FLDZ-GET-EIP-80C 
SHELLCODE:X86:FLDZ-GET-EIP-80S 
SHELLCODE:X86:WIN32-ENUM-80C 
SHELLCODE:X86:WIN32-ENUM-80S 
and another file that has some of the contents of file 1 called 2537_2550
HTTP:STC:OUTLOOK:MAILTO-QUOT-CE
HTTP:STC:HSC:HCP-QUOTE-SCRIPT
HTTP:STC:HSC:MS-HSC-URL-VLN
HTTP:STC:TELNET-URL-OPTS
HTTP:STC:NOTES-INI
HTTP:STC:MOZILLA:SHELL
HTTP:STC:RESIZE-DOS
HTTP:STC:IE:SHELL-WEB-FOLDER
HTTP:STC:IE:IE-MHT-REDIRECT
HTTP:IIS:ASP-DOT-NET-BACKSLASH
APP:SECURECRT-CONF
HTTP:STC:IE:IE-FTP-CMD
HTTP:STC:IE:URL-HIDING-ENC
HTTP:STC:MOZILLA:IFRAME-SRC
HTTP:STC:JAVA:MAL-JNLP-FILE
HTTP:STC:MOZILLA:WRAPPED-JAVA
HTTP:STC:MOZILLA:ICONURL-JS
APP:REAL:PLAYER-FORMAT-STRING
HTTP:STC:IE:FULLMEM-RELOAD
HTTP:STC:DL:PPT-SCRIPT
HTTP:STC:MOZILLA:FIREUNICODE
HTTP:STC:IE:MULTI-ACTION
HTTP:STC:IE:CREATETEXTRANGE
HTTP:STC:IE:HTML-TAG-MC
HTTP:STC:IE:NESTED-OBJECT-TAG
SHELLCODE:JS:UNICODE-ENC
HTTP:STC:IE:UTF8-DECODE-OF
HTTP:STC:IE:VML-FILL-BOF
HTTP:STC:MOZILLA:FF-DEL-OBJ-REF
HTTP:STC:ADOBE:ACROBAT-URL-DF
HTTP:STC:CLSID:ACTIVEX:TREND-AX
HTTP:XSS:IE7-XSS
HTTP:STC:NAV-REDIR
HTTP:STC:ACTIVEX:AOL-AMPX
HTTP:STC:ACTIVEX:IENIPP
HTTP:STC:ACTIVEX:REAL-PLAYER
HTTP:STC:ACTIVEX:ORBIT-DWNLDR
HTTP:STC:SEARCH-LINK
HTTP:STC:ITUNES-HANDLER-OF
HTTP:STC:OPERA:FILE-URL-OF
HTTP:STC:ACTIVEX:EASYMAIL
HTTP:STC:ACTIVEX:IETAB-AX
HTTP:STC:ADOBE:PDF-LIBTIFF
HTTP:STC:IE:TOSTATIC-DISC
HTTP:STC:WHSC-RCE
HTTP:STC:IE:CROSS-DOMAIN-INFO
HTTP:STC:IE:UNISCRIBE-FNPS-MC
HTTP:STC:IE:CSS-OF
HTTP:STC:OBJ-FILE-BASE64
HTTP:STC:IE:ANIMATEMOTION
HTTP:STC:CHROME:GURL-XO-BYPASS
HTTP:STC:SAFARI:WEBKIT-1ST-LTR
HTTP:STC:IE:BOUNDELEMENTS
HTTP:STC:IE:IFRAME-MEM-CORR
HTTP:STC:STREAM:QT-HREFTRACK
HTTP:STC:MOZILLA:CONSTRUCTFRAME
HTTP:STC:MOZILLA:ARGMNT-FUNC-CE
HTTP:STC:ADOBE:PS-PNG-BO
HTTP:STC:IE:HTML-RELOAD-CORRUPT
HTTP:STC:IE:TABLE-SPAN-CORRUPT
HTTP:STC:IE:TABLE-LAYOUT
HTTP:STC:DL:MSHTML-DBLFREE
HTTP:STC:IE:EVENT-INVOKE
HTTP:STC:IE:DEREF-OBJ-ACCESS
HTTP:STC:IE:TOSTATIC-XSS
HTTP:STC:ON-BEFORE-UNLOAD
HTTP:STC:DL:MAL-WOFF
HTTP:STC:DL:EOT-IO
HTTP:STC:MOZILLA:FF-REMOTE-MC
HTTP:STC:DL:DIRECTX-SAMI
HTTP:STC:IE:ONREADYSTATE
HTTP:STC:DL:VML-GRADIENT
HTTP:STC:IE:TABLES-MEMCORRUPT
HTTP:STC:JAVA:DOCBASE-BOF
HTTP:STC:IE:CLIP-MEM
HTTP:STC:ACTIVEX:WMI-ADMIN
HTTP:STC:MOZILLA:DOC-WRITE-MC
HTTP:STC:IE:SELECT-ELEMENT
HTTP:STC:IE:XML-ELEMENT-RCE
SHELLCODE:X86:FNSTENV-80C
HTTP:STC:IE:OBJ-MGMT-MC
HTTP:STC:DL:XLS-DATA-INIT
HTTP:STC:ADOBE:FLASH-RUNTIME
HTTP:STC:ACTIVEX:ISSYMBOL
HTTP:STC:ADOBE:FLASH-ARGREST
HTTP:STC:IE:VML-RCE
HTTP:STC:IE:HTML-TIME
HTTP:STC:IE:LAYOUT-GRID
HTTP:STC:IE:CELEMENT-RCE
HTTP:STC:IE:SELECT-EMPTY
HTTP:XSS:MS-IE-TOSTATICHTML
HTTP:STC:SAFARI:WEBKIT-FREE-CE
HTTP:IIS:ASP-PAGE-BOF
HTTP:STC:MOZILLA:FIREFOX-MC
HTTP:STC:MOZILLA:FF-XSL-TRANS
HTTP:STC:DL:MS-NET-CLILOADER-MC
HTTP:STC:MOZILLA:CLEARTEXTRUN
HTTP:STC:MOZILLA:FIREFOX-ENG-MC
HTTP:STC:MOZILLA:PARAM-OF
HTTP:ORACLE:COREL-DRAW-BO
HTTP:STC:MOZILLA:JIT-ESCAPE-MC
HTTP:STC:SAFARI:WEBKIT-SVG-MC
HTTP:STC:SAFARI:INNERHTML-MC
HTTP:STC:MOZILLA:NSCSSVALUE-OF
HTTP:NOVELL:GROUPWISE-IMG-BOF
I tried
grep -Ff mine 2537_2550 but the grep wasn't able to search?

Using exactly your input and your command I'm able to find the matching lines:
$ grep -Ff file1 file2
HTTP:STC:ADOBE:PDF-LIBTIFF
HTTP:STC:ADOBE:PS-PNG-BO
HTTP:STC:DL:EOT-IO
HTTP:STC:IE:CLIP-MEM
HTTP:STC:DL:XLS-DATA-INIT
HTTP:STC:ADOBE:FLASH-RUNTIME
HTTP:STC:ADOBE:FLASH-ARGREST
HTTP:STC:DL:MS-NET-CLILOADER-MC
HTTP:ORACLE:COREL-DRAW-BO
Probably you have some non-printable character that prevents you from finding the matches.
Try to remove non printable characters from both your files with the following command:
tr -cd '\11\12\15\40-\176' < infile > outfile

I have used the input data you have mentioned and it is working .
Following output is given
$ grep -Ff pattern searchFile
HTTP:STC:ADOBE:PDF-LIBTIFF
HTTP:STC:ADOBE:PS-PNG-BO
HTTP:STC:DL:EOT-IO
HTTP:STC:IE:CLIP-MEM
HTTP:STC:DL:XLS-DATA-INIT
HTTP:STC:ADOBE:FLASH-RUNTIME
HTTP:STC:ADOBE:FLASH-ARGREST
HTTP:STC:DL:MS-NET-CLILOADER-MC
HTTP:ORACLE:COREL-DRAW-BO
Probably there is some non-printable characters in your file .
use cat -vte filename to look for them.
In case your file have been ftped from some different OS server like windows , use dos2unix filename to convert it into unix file format

Related

How to split file based on first character in Linux shell

I do have a fixedwidth flatfile with the header and detail data. Both of them can be recognized by the first character: 1 for header and 2 for detail.
I want to genrate 2 different files from my fixedwidth file , each file having it's own record set, but without type record written.
File Header.txt having only type 1 records.
File Detail.txt having only type 2 records.
Please Let me know how we can achieve this.
Example flatfile:
120190301,025712,FRANK,DURAND,USA
20257120023.12
20257120000.21
20257120191.45
120190301,025737,ERICK,SMITH,USA
20257370000.29
20257370326.41
120190301,025632,JOSEPH,SILVA,USA
20256320019.57
20256320029.12
20256320129.04
Desired Outputs:
Header.txt
20190301,025712,FRANK,DURAND,USA
20190301,025737,ERICK,SMITH,USA
20190301,025632,JOSEPH,SILVA,USA
Detail.txt
0257120023.12
0257120000.21
0257120191.45
0257370000.29
0257370326.41
0256320019.57
0256320029.12
0256320129.04
This first one is gawk-specific and works because in gawk "If the value [of FS] is the null string (""), then each character in the record becomes a separate field."
$ awk 'BEGIN {FS=""; f[1]="header.txt"; f[2]="detail.txt"}
{i=$1; sub(/^./,""); print > f[i]}' file
$ cat header.txt
20190301,025712,FRANK,DURAND,USA
20190301,025737,ERICK,SMITH,USA
20190301,025632,JOSEPH,SILVA,USA
$ cat detail.txt
0257120023.12
0257120000.21
0257120191.45
0257370000.29
0257370326.41
0256320019.57
0256320029.12
One that should work with any awk:
$ awk '/^1/ {f="header.txt"}
/^2/ {f="detail.txt"}
{sub(/^./,""); print > f}' file
awk '{if(/^1/){ sub(/^./,""); print > "Header.txt" }else{sub(/^./,""); print>"Detail.txt"}}' flatfile
If the first character of a line matches 1, strip the first character and write the line to Header.txt, otherwise strip the first character and write the line to Detail.txt.
Outputs:
cat Header.txt
20190301,025712,FRANK,DURAND,USA
20190301,025737,ERICK,SMITH,USA
20190301,025632,JOSEPH,SILVA,USA
And number two:
cat Detail.txt
0257120023.12
0257120000.21
0257120191.45
0257370000.29
0257370326.41
0256320019.57
0256320029.12
0256320129.04
To reduce IO, it redirect multi files to use "tee" command.
$ tee <All.txt >/dev/null \
>(sed -n '/^1/s/^1//p' >Header.txt) \
>(sed -n '/^2/s/^2//p' >Detail.txt)
$ cat Header.txt
20190301,025712,FRANK,DURAND,USA
20190301,025737,ERICK,SMITH,USA
20190301,025632,JOSEPH,SILVA,USA
$ cat Detail.txt
0257120023.12
0257120000.21
0257120191.45
0257370000.29
0257370326.41
0256320019.57
0256320029.12
0256320129.04

Get a particular string from text file

I need to get a particular string from a text file. the content of my file is below :
Components at each of the following levels must be
built before components at higher-numbered levels.
1. SACHHYA-opkg-utils master#964c29cc453ccd3d1b28fb40bae8df11c0dc3b3c
SACHHYA-web-SABARMATI-ap-page master#3bdc2dc1e5cee745cfced370201352045cd57195
SACHHYA-web-update-page master#24b0ffaad4d130ae5a2df0e470868846c7888392
SACHHYAWebMonaco Release/MR1_2019/3.0.7-570+36a238d
googletest-qc8017_32 branches/googletest#2692
LpmMfgTool Release/master/0.0.1-4+34833d6
opensource-avahi-qc8017_32 Release/SACHHYA-master/v1.0-4-gb70507e
opensource-OpenAvnuApple-qc8017_32 Release/SACHHYA-master/v1.0-1766-g1098033
opensource-opkg-qc8017_32 Release/SACHHYA-dev/v0.3.6.2-2-gb1e1aba
opensource-unzip-qc8017_32 Release/master/v6.0.0
opensource-util-linux-qc8017_32 Release/SACHHYA-master/1.5.0-10+877ade5
opensource-zip-qc8017_32 Release/master/v3.0.0
product-startup Release/master/4.0.0-5+5179185
ProductControllerCommon master#a1e71509aaaa9cf7a9e70d4e9c7bfc80d76e13a2
ProductUIAssets master#220944def647a72ce0194d43ef23f1d3fe146987
proprietary-airplay2-qc8017_32 Release/SACHHYA-master/2.0.2-15-g88c1c1d
SABARMATI-HSP-Images Release/master/4.4
SABARMATI-Toolchain Release/master/4.4
SABRMATILPM trunk#3408
SABARMATILpmTools #3604
SABARMATILpmUpdater Release/master/1.0.0-69+a38d6c8
The command that i am trying is :
awk /SACHHYAWebMonaco/ MyFile.txt
Using this command, I am able to get that particular line in which my string is present. Here is the result of the awk command :
SACHHYAWebMonaco Release/MR1_2019/3.0.7-570+36a238d
What I want to grep is only "3.0.7" (which is the version) from that line .
Can anyone have any suggestion to do that?
You can use / and - as field separators and print the third field.
This assumes the format of the lines and position of the information you seek will always be such.
$ awk -F[/-] '/SACHHYAWebMonaco/ {print $3}' file
3.0.7
Perl solution
$ perl -F"[/-]" -lane ' print "$F[2]" if /SACHHYAWebMonaco/ ' sachhya.txt
3.0.7

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

Bash script to list files periodically

I have a huge set of files, 64,000, and I want to create a Bash script that lists the name of files using
ls -1 > file.txt
for every 4,000 files and store the resulted file.txt in a separate folder. So, every 4000 files have their names listed in a text files that is stored in a folder. The result is
folder01 contains file.txt that lists files #0-#4000
folder02 contains file.txt that lists files #4001-#8000
folder03 contains file.txt that lists files #8001-#12000
.
.
.
folder16 contains file.txt that lists files #60000-#64000
Thank you very much in advance
You can try
ls -1 | awk '
{
if (! ((NR-1)%4000)) {
if (j) close(fnn)
fn=sprintf("folder%02d",++j)
system("mkdir "fn)
fnn=fn"/file.txt"
}
print >> fnn
}'
Explanation:
NR is the current record number in awk, that is: the current line number.
NR starts at 1, on the first line, so we subtract 1 such that the if statement is true for the first line
system calls an operating system function from within awk
print in itself prints the current line to standard output, we can redirect (and append) the output to the file using >>
All uninitialized variables in awk will have a zero value, so we do not need to say j=0 in the beginning of the program
This will get you pretty close;
ls -1 | split -l 4000 -d - folder
Run the result of ls through split, breaking every 4000 lines (-l 4000), using numeric suffixes (-d), from standard input (-) and start the naming of the files with folder.
Results in folder00, folder01, ...
Here an exact solution using awk:
ls -1 | awk '
(NR-1) % 4000 == 0 {
dir = sprintf("folder%02d", ++nr)
system("mkdir -p " dir);
}
{ print >> dir "/file.txt"} '
There are already some good answers above, but I would also suggest you take a look at the watch command. This will re-run a command every n seconds, so you can, well, watch the output.

How to remove the lines which appear on file B from another file A?

I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.
If the files are sorted (they are in your example):
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...
See the man page here
grep -Fvxf <lines-to-remove> <all-lines>
works on non-sorted files (unlike comm)
maintains the order
is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
GitHub upstream.
usage:
remove-lines lines-to-remove remove-from-this-file
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another
awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...
Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)
You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format is for lines that are in file b but not in a
--old-.. is for lines that are in file a but not in b
--unchanged-.. is for lines that are in both.
%L makes it so the line is printed exactly.
man diff
for more details
This refinement of #karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup="$LOOKUP" '
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)
You can use Python:
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("file A", "r") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
'
You can use -
diff fileA fileB | grep "^>" | cut -c3- > fileA
This will work for files that are not sorted as well.
Just to add to the Python answer to the user above, here is a faster solution:
python -c '
lines_to_remove = None
with open("partial file") as f:
lines_to_remove = {line.rstrip() for line in f.readlines()}
remaining_lines = None
with open("full file") as f:
remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove
with open("output file", "w") as f:
for line in remaining_lines:
f.write(line + "\n")
'
Raising the power of set subtraction.
To get the file after removing the lines which appears on another file
comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt
Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.
lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt
To remove common lines between two files you can use grep, comm or join command.
grep only works for small files. Use -v along with -f.
grep -vf file2 file1
This displays lines from file1 that do not match any line in file2.
comm is a utility command that works on lexically sorted files. It
takes two files as input and produces three text columns as output:
lines only in the first file; lines only in the second file; and lines
in both files. You can suppress printing of any column by using -1, -2
or -3 option accordingly.
comm -1 -3 file2 file1
This displays lines from file1 that do not match any line in file2.
Finally, there is join, a utility command that performs an equality
join on the specified files. Its -v option also allows to remove
common lines between two files.
join -v1 -v2 file1 file2

Resources