How to delete in file 1 which is not in file 2 - linux

I have 2 files; file1 and file2. File1 has many lines/rows and columns. File2 has just one column, with several lines/rows. All of the strings in file2 are found in file1. I want to create a new file (file3),
For example,
File1:
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
File2:
083
121
Desired file3:
Sally ate 083 popcorn
Bruce knows 121 people

Just use grep -f:
$ cat file1
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
$ cat file2
083
121
$ grep -f file2 file1
Sally ate 083 popcorn
Bruce knows 121 people
To save the output in file3:
grep -f file2 file1 > file3

Related

How to delete lines in file1 based on column match with file2

I have 2 files; file1 and file2. File1 has many lines/rows and columns. File2 has just one column, with several lines/rows. All of the strings in file2 are found in file1. I want to create a new file (file3), such that the lines in file1 that contain the any of the strings in file2 are deleted.
For example,
File1:
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
File2:
083
121
Desired file3:
Rick has 241 cars
John won 505 dollars
Note that I do not want to enter the strings in file 2 into a command manually (the actual files are much larger than in the example).
Thanks!
awk approach:
awk 'BEGIN{p=""}FNR==NR{if(!/^$/){p=p$0"|"} next} $0!~substr(p, 1, length(p)-1)' file2 file1 > file3
p="" the variable treated as pattern containing all column values from file2
FNR==NR ensures that the next expression is performed for the first input file i.e. file2
if(!/^$/){p=p$0"|"} means: if it's not an empty line !/^$/ (as it could be according to your input) concatenate pattern parts with | so it eventually will look like 083|121|
$0!~substr(p, 1, length(p)-1) - checks if a line from the second input file(file1) is not matched with pattern(i.e. file2 column values)
The file3 contents:
Rick has 241 cars
John won 505 dollars
grep suites your purpose better than a line editor
grep -v -f File2 File1 >File3
Try this -
#cat f1
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
#cat f2
083
121
#grep -vwf f2 f1
Rick has 241 cars
John won 505 dollars

Print the missing words and the file name - linux

I have two files in the given format:
File 1:
India 215.0
country 165.0
Indian 163.0
s 133.0
Maoist 103.0
Nepal 89.0
group 85.0
Kathmandu 85.0
File 2:
Nepal 89.0
would 88.0
Kathmandu 85.0
rule 82.0
king 80.0
parliament 79.0
card 79.0
I want to print the words that are present in one file but not the other. The file in which each word is found should also be printed next to the word. For example, I want the output to be:
India 215.0, file 1
country 165.0, file 1
group 85.0, file 1
....
....
would 88.0, file 2
I tried using:
grep -v file1 file2
I get the words that are not present in file2, but I want the words that are present in file1 and not file2 and vice-versa, with their respective file names. How can I achieve this? Please help!
# print out all the rows only in file2 and append filename
$ awk 'NR==FNR{a[$1]++;next} !($1 in a){print $0, FILENAME}' file1 file2
would 88.0 file2
rule 82.0 file2
king 80.0 file2
parliament 79.0 file2
card 79.0 file2
# print all the rows only in file1 and append filename
$ awk 'NR==FNR{a[$1]++;next} !($1 in a){print $0, FILENAME}' file2 file1
India 215.0 file1
country 165.0 file1
Indian 163.0 file1
s 133.0 file1
Maoist 103.0 file1
group 85.0 file1
The default field separator is space, $1 is the first column.

grep shows occurrences of pattern on a per line basis

From the input file:
I am Peter
I am Mary
I am Peter Peter Peter
I am Peter Peter
I want output to be like this:
1 I am Peter
3 I am Peter Peter Peter
2 I am Peter Peter
Where 1, 3 and 2 are occurrences of "Peter".
I tried this, but the info is not formatted the way I wanted:
grep -o -n Peter inputfile
This is not easily solved with grep, I would suggest moving "two tools up" to awk:
awk '$0 ~ FS { print NF-1, $0 }' FS="Peter" inputfile
Output:
1 I am Peter
3 I am Peter Peter Peter
2 I am Peter Peter
###Edit
To answer a question in the comments:
What if I want case insensitive? and what if I want multiple pattern
like "Peter|Mary|Paul", so "I am Peter peter pAul Mary marY John",
will yield the count of 5?
If you are using GNU awk, you do it by enabling IGNORECASE and setting the pattern in FS like this:
awk '$0 ~ FS { print NF-1, $0 }' IGNORECASE=1 FS="Peter|Mary|Paul" inputfile
Output:
1 I am Peter
1 I am Mary
3 I am Peter Peter Peter
2 I am Peter Peter
5 I am Peter peter pAul Mary marY John
You don’t need -o or -n. From grep --help:
-o, --only-matching show only the part of a line matching PATTERN
...
-n, --line-number print line number with output lines
Remove them and your output will be better. I think you’re misinterpreting -n -- it just shows the line number, not the occurrence count.
It looks like you’re trying to get the count of “Peter” appearances per line. You’d need something beyond a single grep for that. awk could be a good choice. Or you could loop over each each line to split into words (say an array) and grep -c the array for each line, to print the line’s count.

How to output list of unique entries in a tab-delimited file

I have a file
1 ABC 123 345 Apples
1 ABC 345 345 Apples
1 ABC 123 345 Apples_Fuji
1 ABC 123 345 ApplesApplesApples
1 ABC 123 345 Pears
1 ABC 123 345 Banana
...
I wish to get an output file
Apples 2
Apples_Fuji 1
ApplesApplesApples 1
Pears 1
Banana 1
...
I'm not sure whether grepping them one at a time would work (-o would be inaccurate anyway, -c is strangely giving me a value of 1 everytime).
Solution with cut,sort,uniq
cat test | cut -f5,5 | sort | uniq -c
Try awk:
$ awk '{arr[$NF]++}END{for(i in arr) print i,arr[i]}' file
ApplesApplesApples 1
Apples 2
Banana 1
Apples_Fuji 1
Pears 1
Here's another way, using grep and uniq:
$ grep -oE '[^ ]+$' file | sort | uniq -c
2 Apples
1 Apples_Fuji
1 ApplesApplesApples
1 Pears
1 Banana
One solution using awk/sort/uniq could be:
cat file|awk '{print $5}'|sort|uniq -c
In Perl:
perl -lane '$h{$F[4]}++ unless(/^\s*$/);END{print "$_ $h{$_}" for(keys %h)}' your_file
Tested Below:
> cat temp
1 ABC 123 345 Apples
1 ABC 345 345 Apples
1 ABC 123 345 Apples_Fuji
1 ABC 123 345 ApplesApplesApples
1 ABC 123 345 Pears
1 ABC 123 345 Banana
> perl -lane '$h{$F[4]}++ unless(/^\s*$/);END{print "$_ $h{$_}" for(keys %h)}' temp
Pears 1
ApplesApplesApples 1
Banana 1
Apples 2
Apples_Fuji 1
>

Combine results of column one Then sum column 2 to list total for each entry in column one

I am bit of Bash newbie, so please bear with me here.
I have a text file dumped by another software (that I have no control over) listing each user with number of times accessing certain resource that looks like this:
Jim 109
Bob 94
John 92
Sean 91
Mark 85
Richard 84
Jim 79
Bob 70
John 67
Sean 62
Mark 59
Richard 58
Jim 57
Bob 55
John 49
Sean 48
Mark 46
.
.
.
My goal here is to get an output like this.
Jim [Total for Jim]
Bob [Total for Bob]
John [Total for John]
And so on.
Names change each time I run the query in the software, so static search on each name and then piping through wc does not help.
This sounds like a job for awk :) Pipe the output of your program to the following awk script:
your_program | awk '{a[$1]+=$2}END{for(name in a)print name " " a[name]}'
Output:
Sean 201
Bob 219
Jim 245
Mark 190
Richard 142
John 208
The awk script itself can be explained better in this format:
# executed on each line
{
# 'a' is an array. It will be initialized
# as an empty array by awk on it's first usage
# '$1' contains the first column - the name
# '$2' contains the second column - the amount
#
# on every line the total score of 'name'
# will be incremented by 'amount'
a[$1]+=$2
}
# executed at the end of input
END{
# print every name and its score
for(name in a)print name " " a[name]
}
Note, to get the output sorted by score, you can add another pipe to sort -r -k2. -r -k2 sorts the by the second column in reverse order:
your_program | awk '{a[$1]+=$2}END{for(n in a)print n" "a[n]}' | sort -r -k2
Output:
Jim 245
Bob 219
John 208
Sean 201
Mark 190
Richard 142
Pure Bash:
declare -A result # an associative array
while read name value; do
((result[$name]+=value))
done < "$infile"
for name in ${!result[*]}; do
printf "%-10s%10d\n" $name ${result[$name]}
done
If the first 'done' has no redirection from an input file
this script can be used with a pipe:
your_program | ./script.sh
and sorting the output
your_program | ./script.sh | sort
The output:
Bob 219
Richard 142
Jim 245
Mark 190
John 208
Sean 201
GNU datamash:
datamash -W -s -g1 sum 2 < input.txt
Output:
Bob 219
Jim 245
John 208
Mark 190
Richard 142
Sean 201

Resources