How To Check File Name from 2nd List Name in Linux? - linux

I want to ask, how to check file if I've two list name's, like
cat /data/file1/ab.txt
aa
bb
cc
dd
ee
cat /data/file2/cd.txt
cc
dd
ee
aa
zz
xx
yy
and I want the output something like :
zz
xx
yy

sort ab.txt > /tmp/file1
sort cd.txt > /tmp/file2
comm -13 /tmp/file1 /tmp/file2
The comm program compares two files and shows the lines that they have in common or unique to each. -13 means to omit the lines that are unique to file 1 and in common.

You can also use grep:
$ grep -vf ab.txt cd.txt
zz
xx
yy
-f tells grep to obtain patterns from ab.txt and -v inverts the matches.

You can also use awk:
awk 'NR==FNR{a[$1];next}!($1 in a)' ab.txt cd.txt

Related

4 lines invert grep search in a directory that contains many files

I have many log files in a directory. In those files, there are many lines. Some of these lines contain ERROR word.
I am using grep ERROR abc.* to get error lines from all the abc1,abc2,abc3,etc files.
Now, there are 4-5 ERROR lines that I want to avoid.
So, I am using
grep ERROR abc* | grep -v 'str1\| str2'
This works fine. But when I insert 1 more string,
grep ERROR abc* | grep -v 'str1\| str2\| str3
it doesn’t get affected.
I need to avoid 4-5 strings.. can anybody suggest a solution?
You are using multiple search pattern, i.e. in a way a regex expression. -E in grep supports an extended regular expression as you can see from the man page below
-e PATTERN, --regexp=PATTERN
Use PATTERN as the pattern. This can be used to specify multiple search patterns, or to protect a pattern beginning with a hyphen (-). (-e is specified by POSIX.)
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
So you need to use the -E flag along with the -v invert search
grep ERROR abc* | grep -Ev 'str1|str2|str3|str4|str5'
An example of the usage for your reference:-
$ cat sample.txt
ID F1 F2 F3 F4 ID F1 F2 F3 F4
aa aa
bb 1 2 3 4 bb 1 2 3 4
cc 1 2 3 4 cc 1 2 3 4
dd 1 2 3 4 dd 1 2 3 4
xx xx
$ grep -vE "aa|xx|yy|F2|cc|dd" sample.txt
bb 1 2 3 4 bb 1 2 3 4
Your example should work, but you can also use
grep ERROR abc* | grep -e 'str1' -e 'str2' -e 'str3' -v

Printing all the lines that contain a certain word exactly k times

I have to search for all the lines from a file which contain a given word exactly k times. I think that I should use grep/sed/awk but I don't know how. My idea was to check every line by line using sed and grep like this:
line=1
while [ (sed -n -'($line)p' $name) -n ]; do
if [ (sed -n -'($line)p' $name | grep -w -c $word) -eq "$number" ]; then
sed -n -'($line)p' $name
fi
let line+=1
done
My first problem is that I get the following error : syntax error near unexpected token 'sed'. Then I realize that for my test file the command sed -n -'p1' test.txt | grep -w -c "ab" doesn't return the exact number of apparitions of "ab" in the first line from my file (it returns 1 but there are 3 apparitions).
My test.txt file:
abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b
awk to the rescue!
$ awk -F'\\<ab\\>' -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab
note that \< and \> word boundaries might be gawk specific.
for variable assignment, I think easiest will be
$ word=ab; awk -F"\\\<$word\\\>" -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab
You could use grep, but you'd have to use it twice. (You can't use a single grep because ERE has no way to negate a string, you can only negate a bracket expression, which will match single characters.)
The following is tested with GNU grep v2.5.1, where you can use \< and \> as (possibly non-portable) word delimiters:
$ word="ab"
$ < input.txt egrep "(\<$word\>.*){3}" | egrep -v "(\<$word\>.*){4}"
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
$ < input.txt egrep "(\<$word\>.*){2}" | egrep -v "(\<$word\>.*){3}"
kkmd ab jnabc bad ab
The idea here is that we'll extract from our input file lines with N occurrences of the word, then strip from that result any lines with N+1 occurrences. Lines with fewer than N occurrences of course won't be matched by the first grep.
Or, you might also do this in pure bash, if you're feeling slightly masochistic:
$ word="ab"; num=3
$ readarray lines < input.txt
$ for this in "${lines[#]}"; do declare -A words=(); x=( $this ); for y in "${x[#]}"; do ((words[$y]++)); done; [ "0${words[$word]}" -eq "$num" ] && echo "$this"; done
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
Broken out for easier reading (or scripting):
#!/usr/bin/env bash
# Salt to taste
word="ab"; num=3
# Pull content into an array. This isn't strictly necessary, but I like
# getting my file IO over with quickly if possible.
readarray lines < input.txt
# Walk through the array (or you could just walk through the input file)
for this in "${lines[#]}"; do
# Initialize this line's counter array
declare -A words=()
# Break up the words into array elements
x=( $this )
# Step though the array, counting each unique word
for y in "${x[#]}"; do
((words[$y]++))
done
# Check the count for "our" word
[ "0${words[$word]}" -eq $num ] && echo "$this"
done
Wasn't that fun? :)
But this awk option makes the most sense to me. It's a portable one-liner that doesn't depend on GNU awk (so it'll work in OS X, BSD, etc.)
awk -v word="ab" -v num=3 '{for(i=1;i<=NF;i++){a[$i]++}} a[word]==num; {delete a}' input.txt
This works by building an associative array to count the words on each line, then printing the line if the count for the "interesting" word is what's specified as num. It's the same basic concept as the bash script above, but awk lets us do this so much better. :)
You can do this with grep
grep -E "(${word}.*){${number}}" test.txt
This looks for ${number} occurrences of ${word} per line. The wildcard .* is needed since we also want to match occurrences where matches of ${word} are not next to each other.
Here's what I do:
$ echo 'abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b' > test.txt
$ word=abc
$ number=2
$ grep -E "(${word}.*){${number}}" test.txt
> abc ab cds ab abcd edfs ab
> abcde bad abc cdef a b
Maybe you need to use sed. If you are looking for character sequences, you can use code like this. However, it doesn't distinguish between the word on its own and the word embedded in another word (so it treats ab and abc as both containing ab).
word="ab"
number=2
sed -n -e "/\($word.*\)\{$(($number + 1))\}/d" -e "/\($word.*\)\{$number\}/p" test.txt
By default, nothing is printed (-n).
The first -e expression looks for 3 (or more) occurrences of $word and deletes lines containing them (and skips to the next line of input). The $(($number + 1)) is shell arithmetic.
The second -e expressions looks for 2 occurrences of $word (there won't be more) and prints the lines that match.
If you want words on their own, then you have to work a lot harder. You'd need extended regular expressions, triggered with the -E option on BSD (Mac OS X), or -r with GNU sed.
number=2
plus1=$(($number + 1))
word=ab
sed -En -e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$plus1}/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$number}$word$/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]|$).*){$number}/p" test.txt
This is similar to the previous version, but it has considerably more delicate word handling.
The unit (^|[^[:alnum:]]) looks for either the start of line or a non-alphanumeric character (change alnum to alpha throughout if you don't want digits to stop matches).
The first -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters, N+1 times, and deletes such lines (skipping to the next line of input).
The second -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times, and then the word again followed by end of line, and deletes such lines.
The third -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times and prints such lines.
Given the (extended) input file:
abc NO ab cds ab abcd edfs ab
kkmd YES ab jnabc bad ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO e bad abc cdef a b
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab NO abcd abcd ab ab
hope NO abcd abcd ab ab ab
nope NO abcd abcd ab ab ab
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
Example output:
kkmd YES ab jnabc bad ab
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
It is not a trivial exercise in sed. It would be simpler if you could rely on word-boundary detection. For example, in Perl:
number=2
plus1=$(($number + 1))
word=ab
perl -n -e "next if /(\b$word\b.*?){$plus1}/;
print if /(\b$word\b.*?){$number}/" test.txt
This produces the same output as the sed script, but is a lot simpler because of the \b word boundary detection (the .*? non-greedy matching isn't crucial to the operation of the script).

How to extract some missing rows by comparing two different files in linux?

I have two diferrent files which some rows are missing in one of the files. I want to make a new file including those non-common rows between two files. as and example, I have following files:
file1:
id1
id22
id3
id4
id43
id100
id433
file2:
id1
id2
id22
id3
id4
id8
id43
id100
id433
id21
I want to extract those rows which exist in file2 but do not in file1:
new file:
id2
id8
id21
any suggestion please?
Use the comm utility (assumes bash as the shell):
comm -13 <(sort file1) <(sort file2)
Note how the input must be sorted for this to work, so your delta will be sorted, too.
comm uses an (interleaved) 3-column layout:
column 1: lines only in file1
column 2: lines only in file2
column 3: lines in both files
-13 suppresses columns 1 and 2, which prints only the values exclusive to file2.
Caveat: For lines to be recognized as common to both files they must match exactly - seemingly identical lines that differ in terms of whitespace (as is the case in the sample data in the question as of this writing, where file1 lines have a trailing space) will not match.
cat -et is a command that visualizes line endings and control characters, which is helpful in diagnosing such problems.
For instance, cat -et file1 would output lines such as id1 $, making it obvious that there's a trailing space at the end of the line (represented as $).
If instead of cleaning up file1 you want to compare the files as-is, try:
comm -13 <(sed -E 's/ +$//' file1 | sort) <(sort file2)
A generalized solution that trims leading and trailing whitespace from the lines of both files:
comm -13 <(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file1 | sort) \
<(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file2 | sort)
Note: The above sed commands require either GNU or BSD sed.
Edit: I only wanted to change 1 character but 6 is the minimum... Delete this...
You can try to sort both files then count the duplicate rows and select only those row where the count is 1
sort file1 file2 | uniq -c | awk '$1 == 1 {print $2}'

How to find uncommon lines between two text files using shell script?

I have two text files file1.txt & file2.txt
file1.txt Contains :
a
b
c
file2.txt Contains :
a
b
c
d
e
f
The Output Should be :
d
e
f
The command i'm trying to use is 'diff file2.txt file1.txt'
It gives the common lines only.
Assuming that the input files are sorted:
join -v 2 file1.txt file2.txt
Check man join for details on all the other things join can do for you.
please try below ones
grep -vf file1.txt file2.txt
comm -13 file1.txt file2.txt
for diff you have to perform something extra
diff inp inp1 | grep '>' | cut -f2 -d' '

bash remove the same in file

I have one issue with getting number different strings.
I have two files, for example :
file1 :
aaa1
aaa4
bbb3
ccc2
and
file2:
bbb3
ccc2
aaa4
How from this get value 1 (in this case aaa1 string reason)?
I have one query but it calculates not only different strings, them also takes into account the order of the rows.
diff file1 file2 | grep "<" | wc -l
Thanks.
You can use grep -v -c with other options as this:
grep -cvwFf file2 file1
1
Options used are:
-c - get the count of matches
-v - invert matches
-w - full word match (to avoid partial matches)
-F - fixed string match
-f - Use a file for matching patterns
As far as I understand your requirements, sorting the files prior to the diff is a quick solution:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | egrep "[<>]" | wc -l

Resources