How to find the diffrence between 2 files [duplicate] - linux

This question already has answers here:
Fast way of finding lines in one file that are not in another?
(11 answers)
Closed 4 years ago.
I have 2 txt files.
The first txt file contains something like this: direction:left, move:right
The second txt file contains something like this: direction:right, move:right
Note: on both txt files, everything is on one line.
I want to be able to get the difference between those two txt files. So in the example above, it would return "right".
I tried using grep, comm, and diff. Those didn't work, because instead of printing the exact difference it just printed the different line, I just want the different phrase.
How do I do this in bash?

Use grep -F -x -v -f fileB fileA | cut -d':' -f2.
This works by using each line in fileB as a pattern (-f fileB) and treating it as a plain string to match (not a regular regex) (-F). You force the match to happen on the whole line (-x) and print out only the lines that don't match (-v). Therefore you are printing out the lines in fileA that don't contain the same data as any line in fileB.
Then cut -d':' -f2 splits the string with : as the delimiter and gets the second value.

Related

Removing content existing in another file in bash [duplicate]

This question already has answers here:
Print lines in one file matching patterns in another file
(5 answers)
Closed 4 years ago.
I am attempting to clean one file1.txt that contains always the same lines using file2.txt that contains a list of IP addresses I want to remove.
The working script I have written I believe can be enhanced somehow to be faster in execution.
My script:
#!/bin/bash
IFS=$'\n'
for i in $(cat file1.txt); do
for j in $(cat file2); do
echo ${i} | grep -v ${j}
done
done
I have tested the script with the following data set:
Amount of lines in file1.txt = 10,000
Amount of lines in file2.txt = 3
Scrit execution time:
real 0m31.236s
user 0m0.820s
sys 0m6.816s
file1.txt content:
I3fSgGYBCBKtvxTb9EMz,1.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.2.2.3,45,This IP belongs to office space,1539760502,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.3.2.3,45,This IP belongs to office space,1539760503,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.4.2.3,45,This IP belongs to office space,1539760504,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.5.2.3,45,This IP belongs to office space,1539760505,https://myoffice.com
... lots of other lines in the same format
I3fSgGYBCBKtvxTb9EMz,4.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
file2.txt content:
1.1.2.3
1.2.2.3
... lots of other IPs here
1.2.3.9
How can I improve those timings?
I am confident that the files will grow over time. In my case I will run the script every hour from cron, therefore I would like to improve here.
You want to get rid of all lines in file1.txt that contains substrings which match file2.txt. grep to the rescue
grep -vFwf file2.txt file1.txt
The -w is need to avoid that 11.11.11.11 matches 111.11.11.111
-F, --fixed-strings, --fixed-regexp Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX, --fixed-regexp is an obsoleted alias, please do not use it in new scripts.)
-f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns and therefore matches nothing. (-f is specified by POSIX.)
-w, --word-regexp Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
source: man grep
On a further note, here are a couple of pointers for your script:
Don't use for loops to read files (http://mywiki.wooledge.org/DontReadLinesWithFor).
Don't use cat (See How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?)
Use quotes! (See Bash and Quotes)
This allows us to rewrite it as:
#!/bin/bash
while IFS=$'\n' read -r i; do
while IFS=$'\n' read -r j; do
echo "$i" | grep -v "$j"
done < file2
done < file1
Now the problem is that you read file2 N times. Where N is the number of lines of file1. This is not really efficient. And luckily grep has the solution for us (see top).

finding specific pattern in linux [duplicate]

This question already has answers here:
Print only matching word, not entire line through grep
(2 answers)
Closed 5 years ago.
I want to find specific pattern in all the files in a directory and copy them to another line
For E.g
I want to find LOG_WARNING in one file XYZ and copy them to another file.
LOG_WARNING (abc, xyz,("WARNING: Error in sending concurrent_ to pdm\n"));
command i have used is :
grep -rin "LOG_WARNING.*" file_name.c > output.txt
but it is not copying till the semicolon, please note that other texts are available in next line. I want to copy till ;(semi-colon)
grep -rh "LOG_WARNING" * > out.txt
This will match the pattern in all the files inside the directory.
Since you mentioned that the texts that are present after the ';' are on the next line, I have provided this command.
This will match the pattern and print the entire line, till the ';'.
Else,
try this
grep -roPh 'LOG_WARNING[^;]*;' * > out.txt

I have a requirement of searching a pattern from a file and displaying the pattern only in the screen,not the whole line .How can I do it in linux? [duplicate]

This question already has answers here:
Can grep show only words that match search pattern?
(15 answers)
Closed 5 years ago.
I have a requirement of searching a pattern like x=<followed by any values> from a file and displaying the pattern i.e x=<followed by any values>, only in the screen, not the whole line. How can I do it in Linux?
I have 3 answers, from simple (but with caveats) to complex (but foolproof):
1) If your pattern never appears more than once per line, you could do this (assuming your shell is
PATTERN="x="
sed "s/.*\($PATTERN\).*/\1/g" your_file | grep "$PATTERN"
2) If your pattern can appear more than once per line, it's a bit harder. One easy but hacky way to do this is to use a special characters that will not appear on any line that has your pattern, eg, "#":
PATTERN="x="
SPECIAL="#"
grep "$PATTERN" your_file | sed "s/$PATTERN/$SPECIAL/g" \
| sed "s/[^$SPECIAL]//g" | sed "s/$SPECIAL/$PATTERN/g"
(This won't separate the output pattern per line, eg. you'll see x=x=x= if a source line had 3 times "x=", this is easy to fix by adding a space in the last sed)
3) Something that always works no matter what:
PATTERN="x="
awk "NF>1{for(i=1;i<NF;i++) printf FS; print \"\"}" \
FS="$PATTERN" your_file

Generate record of files which have been removed by grep as a secondary function of primary command

I asked a question here to remove unwanted lines which contained strings which matched a particular pattern:
Remove lines containg string followed by x number of numbers
anubhava provided a good line of code which met my needs perfectly. This code removes any line which contains the string vol followed by a space and three or more consecutive numbers:
grep -Ev '\bvol([[:blank:]]+[[:digit:]]+){2}' file > newfile
The command will be used on a fairly large csv file and be initiated by crontab. For this reason, I would like to keep a record of the lines this command is removing, just so I can go back to check the correct data that is being removed- I guess it will be some sort of log containing the name sof the lines that did not make the final cut. How can I add this functionality?
Drop grep and use awk instead:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print >> "deleted"; next} 1' file
The above uses GNU awk for word delimiters (\<) and will append every deleted line to a file named "deleted". Consider adding a timestamp too:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print systime(), $0 >> "deleted"; next} 1' file

Using grep to find difference between two big wordlists

I have one 78k lines .txt file with british words and a 5k lines .txt file with the most common british words. I want to sort out the most common words from the big list so that I have a new list with the not as common words.
I managed solve my problem in another matter, but I would really like to know, what I am doing wrong since this does not work.
I have tried the following:
//To make sure they are trimmed
cut -d" " -f1 78kfile.txt | tac | tac > 78kfile.txt
cut -d" " -f1 5kfile.txt | tac | tac > 5kfile.txt
grep -xivf 5kfile.txt 78kfile.txt > cleansed
//But this procedure apparently gives me two empty files.
If I run just the grep without cut first, I get words that I know are in both files.
I have also tried this:
sort 78kfile.txt > 78kfile-sorted.txt
sort 5kfile.txt > 5kfile-sorted.txt
comm -3 78kfile-sorted.txt 5kfile-sorted.txt
//No luck either
The two text files in case anyone wants to try for them selves:
https://www.dropbox.com/s/dw3k8ragnvjcfgc/5k-most-common-sorted.txt
https://www.dropbox.com/s/1cvut5z2zp9qnmk/brit-a-z-sorted.txt
After downloading your files, I noticed that (a) brit-a-z-sorted.txt has Microsoft line endings while 5k-most-common-sorted.txt has Unix line endings and (b) you are trying to do whole-line compare (grep -x). So, first we need to convert to a common line ending:
dos2unix <brit-a-z-sorted.txt >brit-a-z-sorted-fixed.txt
Now, we can use grep to remove the common words:
grep -xivFf 5k-most-common-sorted.txt brit-a-z-sorted-fixed.txt >less-common.txt
I also added the -F flag to assure that the words would be interpreted as a fixed strings rather than as regular expressions. This also speeds things up.
I note that there are several words in the 5k-most-common-sorted.txt file that are not in the brit-a-z-sorted.txt. For example, "British" is in the common file but not the larger file. Also the common file has "aluminum" while the larger file has only "aluminium".
What do the grep options mean? For those who are curious:
-f means read the patterns from a file.
-F means treat them as fixed patterns, not regular expressions,
-i mean ignore case.
-x means do whole-line matches
-v means invert the match. In other words, print those lines that do not match any of the patterns.

Resources