intersection between two files linux - linux

I need to compare the difference between two files in lunix exactly I need the intersection for exmple I have in the firts file test.txt this lines
aaaa
bbbb
cccc
dddd
and in the seconde file test2.txt this lines
eeee
ffff
aaaa
gggg
dddd
I need the result as this
aaaa
dddd
I use this commande
comm -23 <(sort -i /var/test.txt) <(sort -i /var/test2.txt) > g.txt
and this is the result
bbbb
cccc
I need the intersection between test and test1 any help
grep take alof of memeory

man comm:
EXAMPLES
comm -12 file1 file2
Print only lines present in both file1 and file2.
So:
$ comm -12 <(sort -i test.txt) <(sort -i test2.txt)
aaaa
dddd

It is unclear whether you are attempting to pick off certain numeric columns (e.. 2, 3, etc..) or if you are attempting to find common words in a line in two separate files (I take the latter to be your goal, let me know if I'm wrong).
In that case, you cannot suppress any column from either file because you don't know which column the common words will reside in after sort. One column agnostic way to fine and output the common words in sort-order is to simply sort one file, take your pick, and then loop over the sorted words calling grep -q to test whether a word is present in the second file, and if so output it (you can control line format as you desire)
One, not especially pretty way to accomplish this is:
for i in $(printf "%s\n" | sort -i test1.txt) ## loop over sorted test1.txt
do
grep -q "$i" test2.txt && echo -n "$i " ## grep if value in test2.txt
done
echo ""
You can wrap it in a subshell (e.g. (....) and just copy and pasted it into your terminal (in the directory with the files test1.txt and test2.txt) to see if this meets your needs, e.g.
Example Use/Output
$ (
> for i in $(printf "%s\n" | sort -i test1.txt)
> do
> grep -q "$i" test2.txt && echo -n "$i "
> done
> echo ""
> )
aaaa dddd
Look things over and let me know if you have further questions.

Related

How to make a strict match with awk

I am querying one file with the other file and have them as following:
File1:
Angela S Darvill| text text text text
Helen Stanley| text text text text
Carol Haigh S|text text text text .....
File2:
Carol Haigh
Helen Stanley
Angela Darvill
This command:
awk 'NR==FNR{_[$1];next} ($1 in _)' File2.txt File1.txt
returns lines that overlap, BUT doesn’t have a strict match. Having a strict match, only Helen Stanley should have been returned.
How do you restrict awk on a strict overlap?
With your shown samples please try following. You were on right track, you need to do 2 things, 1st: take whole line as an index in array a while reading file2.txt and set field seapeator to | before awk starts reading file1
awk -F'|' 'NR==FNR{a[$0];next} $1 in a' File2.txt File1.txt
Command above doesn’t work for me (I am on Mac, don’t know whether it matters), but
awk 'NR==FNR{_[$0];next} ($1 in _)' File2.txt. FS="|" File1.txt
worked well
You can also use grep to match from File2.txt as a list of regexes to make an exact match.
You can use sed to prepare the matches. Here is an example:
sed -E 's/[ \t]*$//; s/^(.*)$/^\1|/' File2.txt
^Carol Haigh|
^Helen Stanley|
^Angela Darvill|
...
Then use process with that sed as an -f argument to grep:
grep -f <(sed -E 's/[ \t]*$//; s/^(.*)$/^\1|/' File2.txt) File1.txt
Helen Stanley| text text text text
Since your example File2.txt has trailing spaces, the sed has s/[ \t]*$//; as the first substitution. If your actual file does not have those trading spaces, you can do:
grep -f <(sed -E 's/.*/^&|/' File2.txt) File1.txt
Ed Morton brings up a good point that grep will still interpret RE meta-characters in File2.txt. You can use the flag -F so only literal strings are used:
grep -F -f <(sed -E 's/.*/&|/' File2.txt) File1.txt

grep between two files

I want to find matching lines from file 2 when compared to file 1.
file2 contains multiple columns and column one contains information that could match file1.
I tried below commands and they didn't give any matching results (contents in file1 are definitely in file2) . I have used these commands previously to compare between different files and they worked.
grep -f file1 file2
grep -Fwf file1 file2
When i tried to grep whatever that's not matching, i get results
grep -vf file1 file2
file1 contains list of genes (754 genes) , one line each
ATM
ATP5B
ATR
ATRIP
ATRX
I have a feeling the problem is with my file1. When I tried to type several items manually in my file1 just to test, and do grep with file2, I get the matching lines from file2.
When I copied the contents of file1 (originally in excel) into notepad making a .txt file, I didn't get any matching results.
I can't see any problem with my file1. Any suggestion?
You said,
I copied the contents of file1 (originally in excel) into notepad making a .txt file
It's likely that the txt file contains carriage-return/linefeed pairs which are screwing up the grep. As I suggested in a comment, try this:
tr -d '\015' < file1 > file1a
grep -Fwf file1a file2
The tr invocation deletes all the carriage returns, giving you a proper Unix/Linux text file with only newlines (\n) as line terminators.
You said:
I can't see any problem with my file1.
Here's how to see the extra-carriage-return problem:
cat -v test1
Those little ^M markers at the end of each line are cat -v's way of showing you the carriage return control codes.
Addendum:
Carriage Return (CR) is decimal 13, hex 0x0d, octal 015, \r in C.
Line Feed (LF) is decimal 10, hex 0x0a, octal 012, \n in C.
Because it's an old-school utility, tr accepts octal (base 8) notation for control characters.
(I think in some versions tr -d '\r' would work, but I'm not sure, and anyway I'm not sure what version you have. tr -d '\015' should be universal.)
Simple shell script that performs grep for every input in file1.txt
#!/bin/bash
while read content; do
grep -q "$content" file2.txt
if [ $? -eq "0" ]; then
echo "$content" was found in file2 >> results.txt
fi
done < file1.txt
Let's suppose this is file2:
$ cat file2
a b ATM
c d e
f ATR g
Using grep and process substitution
We can get lines from file1 that match any of the columns in file2 via:
$ grep -wFf <(sed 's/[[:space:]]/\n/g' file2) file1
ATM
ATR
This works because it converts file2 to a form that grep understands:
$ sed 's/[[:space:]]/\n/g' file2
a
b
ATM
c
d
e
f
ATR
g
Using awk
$ awk 'FNR==NR{for (i=1;i<=NF;i++) seen[$i]; next} $0 in seen' file2 file1
ATM
ATR
Here, awk keeps track of every column that it sees in file2 and then print only those lines in file1 that match one of those columns
Try to use command
comm
it is a reversed version of diff

Concatenation of huge number of selective files from a directory in Shell

I have more than 50000 files in a directory such as file1.txt, file2.txt, ....., file50000.txt. I would like to concatenate of some files whose file numbers are listed in the following text file (need.txt).
need.txt
1
4
35
45
71
.
.
.
I tried with the following. Though it is working, but I look for more simpler and short way.
n1=1
n2=$(wc -l < need.txt)
while [ $n1 -le $n2 ]
do
f1=$(awk 'NR=="$n1" {print $1}' need.txt)
cat file$f1.txt >> out.txt
(( n1++ ))
done
This might also work for you:
sed 's/.*/file&.txt/' < need.txt | xargs cat > out.txt
Something like this should work for you:
sed -e 's/.*/file&.txt/' need.txt | xargs cat > out.txt
It uses sed to translate each line into the appropriate file name and then hands the filenames to xargs to hand them to cat.
Using awk it could be done this way:
awk 'NR==FNR{ARGV[ARGC]="file"$1".txt"; ARGC++; next} {print}' need.txt > out.txt
Which adds each file to the ARGV array of files to process and then prints every line it sees.
It is possible do it without any sed or awk command. Directly using bash built-in functions and cat (of course).
for i in $(cat need.txt); do cat file${i}.txt >> out.txt; done
And as you want, it is quite simple.

paste linux script

I have a small question and would appreciate your help in it please.
I need to merge different text files together using paste command as :
paste -d, ~/Desktop/*.txt > ~/Desktop/Out/merge.txt
However, the files got merged out of order. (text files are numbered 1, 2, 3, etc.)
I am using *.txt since different number of files exist for different scenarios.
Would you mind helping me in it please.
If you use modern bash you can write:
paste -d, ~/Desktop/{1..10}.txt > ~/Desktop/Out/merge.txt
If not, you must use something like:
paste -d, $(seq 1 10 | sed 's#.*#~/Desktop/&.txt) > ~/Desktop/Out/merge.txt
If you don't know which files you have in the directory,
you can list and sort them:
cd ~/Desktop/
paste -d, $(ls -1d *.txt| sort -n) > ~/Desktop/Out/merge.txt
Example:
$ touch {1..20}.txt
$ echo $(ls -1 | sort -n)
1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt 10.txt 11.txt 12.txt 13.txt 14.txt 15.txt 16.txt 17.txt 18.txt 19.txt 20.txt
Example2:
$ echo hello > 1.txt
$ echo dear > 5.txt
$ echo friend > 11.txt
$ paste -d, $(ls -1d *.txt| sort -n)
hello,dear,friend
Here's a rather long way of doing the same but in one line.
paste -d, $(ls ~/Desktop/*.txt | awk -F/ '{print $NF"/"$0}' | sort -n | cut -d/ -f2-) > ~/Desktop/merge.txt
I like one liners :-)
paste -d, $(ls ~/Desktop/*.txt) > ~/Desktop/Out/merge.txt
The * is being replaced by an alphabetically sorted list of filenames of your directory.
3.5.8 Filename Expansion
Bash scans each word for the characters ‘*’, ‘?’, and ‘[’. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern.
So the filenaming does not have to be consecutive ;)

Comparing two files in linux terminal

There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".
I need a efficient algorithm as I need to compare two dictionaries.
if you have vim installed,try this:
vimdiff file1 file2
or
vim -d file1 file2
you will find it fantastic.
Sort them and use comm:
comm -23 <(sort a.txt) <(sort b.txt)
comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.
If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:
git diff --no-index a.txt b.txt
Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:
git diff --no-index a.txt b.txt
# ~1.2s
comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s
diff a.txt b.txt
# ~2.6s
sdiff a.txt b.txt
# ~2.7s
vimdiff a.txt b.txt
# ~3.2s
comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.
Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:
This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.
Try sdiff (man sdiff)
sdiff -s file1 file2
You can use diff tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.
Following three options can use to select the relevant group for each option:
'%<' get lines from FILE1
'%>' get lines from FILE2
'' (empty string) for removing lines from both files.
E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt
[root#vmoracle11 tmp]# cat file1.txt
test one
test two
test three
test four
test eight
[root#vmoracle11 tmp]# cat file2.txt
test one
test three
test nine
[root#vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt
test two
test four
test eight
You can also use: colordiff: Displays the output of diff with colors.
About vimdiff: It allows you to compare files via SSH, for example :
vimdiff /var/log/secure scp://192.168.1.25/var/log/secure
Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html
Also, do not forget about mcdiff - Internal diff viewer of GNU Midnight Commander.
For example:
mcdiff file1 file2
Enjoy!
Use comm -13 (requires sorted files):
$ cat file1
one
two
three
$ cat file2
one
two
three
four
$ comm -13 <(sort file1) <(sort file2)
four
You can also use:
sdiff file1 file2
To display differences side by side within your terminal!
diff a.txt b.txt | grep '<'
can then pipe to cut for a clean output
diff a.txt b.txt | grep '<' | cut -c 3
Here is my solution for this :
mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english
Using awk for it. Test files:
$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one
The awk:
$ awk '
NR==FNR { # process b.txt or the first file
seen[$0] # hash words to hash seen
next # next word in b.txt
} # process a.txt or all files after the first
!($0 in seen)' b.txt a.txt # if word is not hashed to seen, output it
Duplicates are outputed:
four
four
To avoid duplicates, add each newly met word in a.txt to seen hash:
$ awk '
NR==FNR {
seen[$0]
next
}
!($0 in seen) { # if word is not hashed to seen
seen[$0] # hash unseen a.txt words to seen to avoid duplicates
print # and output it
}' b.txt a.txt
Output:
four
If the word lists are comma-separated, like:
$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three
you have to do a couple of extra laps (forloops):
awk -F, ' # comma-separated input
NR==FNR {
for(i=1;i<=NF;i++) # loop all comma-separated fields
seen[$i]
next
}
{
for(i=1;i<=NF;i++)
if(!($i in seen)) {
seen[$i] # this time we buffer output (below):
buffer=buffer (buffer==""?"":",") $i
}
if(buffer!="") { # output unempty buffers after each record in a.txt
print buffer
buffer=""
}
}' b.txt a.txt
Output this time:
four
five,six

Resources