How to delete lines in file1 based on column match with file2 - linux

I have 2 files; file1 and file2. File1 has many lines/rows and columns. File2 has just one column, with several lines/rows. All of the strings in file2 are found in file1. I want to create a new file (file3), such that the lines in file1 that contain the any of the strings in file2 are deleted.
For example,
File1:
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
File2:
083
121
Desired file3:
Rick has 241 cars
John won 505 dollars
Note that I do not want to enter the strings in file 2 into a command manually (the actual files are much larger than in the example).
Thanks!

awk approach:
awk 'BEGIN{p=""}FNR==NR{if(!/^$/){p=p$0"|"} next} $0!~substr(p, 1, length(p)-1)' file2 file1 > file3
p="" the variable treated as pattern containing all column values from file2
FNR==NR ensures that the next expression is performed for the first input file i.e. file2
if(!/^$/){p=p$0"|"} means: if it's not an empty line !/^$/ (as it could be according to your input) concatenate pattern parts with | so it eventually will look like 083|121|
$0!~substr(p, 1, length(p)-1) - checks if a line from the second input file(file1) is not matched with pattern(i.e. file2 column values)
The file3 contents:
Rick has 241 cars
John won 505 dollars

grep suites your purpose better than a line editor
grep -v -f File2 File1 >File3

Try this -
#cat f1
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
#cat f2
083
121
#grep -vwf f2 f1
Rick has 241 cars
John won 505 dollars

Related

How to delete in file 1 which is not in file 2

I have 2 files; file1 and file2. File1 has many lines/rows and columns. File2 has just one column, with several lines/rows. All of the strings in file2 are found in file1. I want to create a new file (file3),
For example,
File1:
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
File2:
083
121
Desired file3:
Sally ate 083 popcorn
Bruce knows 121 people
Just use grep -f:
$ cat file1
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
$ cat file2
083
121
$ grep -f file2 file1
Sally ate 083 popcorn
Bruce knows 121 people
To save the output in file3:
grep -f file2 file1 > file3

diff 2 files with an output that does not include extra lines

I have 2 files test and test1 and I would like to do a diff between them without the output having extra characters 2a3, 4a6, 6a9 as shown below.
mangoes
apples
banana
peach
mango
strawberry
test1:
mangoes
apples
blueberries
banana
peach
blackberries
mango
strawberry
star fruit
when I diff both the files
$ diff test test1
2a3
> blueberries
4a6
> blackberries
6a9
> star fruit
How do I get the output as
$ diff test test1
blueberries
blackberries
star fruit
A solution using comm:
comm -13 <(sort test) <(sort test1)
Explanation
comm - compare two sorted files line by line
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files
As we only need the lines unique to the second file test1, -13 is used to suppress the unwanted columns.
Process Substitution is used to get the sorted files.
You can use grep to filter out lines that are not different text:
$ diff file1 file2 | grep '^[<>]'
> blueberries
> blackberries
> star fruit
If you want to remove the direction indicators that indicate which file differs, use sed:
$ diff file1 file2 | sed -n 's/^[<>] //p'
blueberries
blackberries
star fruit
(But it may be confusing to not see which file differs...)
You can use awk
awk 'NR==FNR{a[$0];next} !($0 in a)' test test1
NR==FNR means currently first file on the command line (i.e. test) is being processed,
a[$0] keeps each record in array named a,
next means read next line without doing anything else,
!($0 in a) means if current line does not exist in a, print it.

Print the missing words and the file name - linux

I have two files in the given format:
File 1:
India 215.0
country 165.0
Indian 163.0
s 133.0
Maoist 103.0
Nepal 89.0
group 85.0
Kathmandu 85.0
File 2:
Nepal 89.0
would 88.0
Kathmandu 85.0
rule 82.0
king 80.0
parliament 79.0
card 79.0
I want to print the words that are present in one file but not the other. The file in which each word is found should also be printed next to the word. For example, I want the output to be:
India 215.0, file 1
country 165.0, file 1
group 85.0, file 1
....
....
would 88.0, file 2
I tried using:
grep -v file1 file2
I get the words that are not present in file2, but I want the words that are present in file1 and not file2 and vice-versa, with their respective file names. How can I achieve this? Please help!
# print out all the rows only in file2 and append filename
$ awk 'NR==FNR{a[$1]++;next} !($1 in a){print $0, FILENAME}' file1 file2
would 88.0 file2
rule 82.0 file2
king 80.0 file2
parliament 79.0 file2
card 79.0 file2
# print all the rows only in file1 and append filename
$ awk 'NR==FNR{a[$1]++;next} !($1 in a){print $0, FILENAME}' file2 file1
India 215.0 file1
country 165.0 file1
Indian 163.0 file1
s 133.0 file1
Maoist 103.0 file1
group 85.0 file1
The default field separator is space, $1 is the first column.

grep shows occurrences of pattern on a per line basis

From the input file:
I am Peter
I am Mary
I am Peter Peter Peter
I am Peter Peter
I want output to be like this:
1 I am Peter
3 I am Peter Peter Peter
2 I am Peter Peter
Where 1, 3 and 2 are occurrences of "Peter".
I tried this, but the info is not formatted the way I wanted:
grep -o -n Peter inputfile
This is not easily solved with grep, I would suggest moving "two tools up" to awk:
awk '$0 ~ FS { print NF-1, $0 }' FS="Peter" inputfile
Output:
1 I am Peter
3 I am Peter Peter Peter
2 I am Peter Peter
###Edit
To answer a question in the comments:
What if I want case insensitive? and what if I want multiple pattern
like "Peter|Mary|Paul", so "I am Peter peter pAul Mary marY John",
will yield the count of 5?
If you are using GNU awk, you do it by enabling IGNORECASE and setting the pattern in FS like this:
awk '$0 ~ FS { print NF-1, $0 }' IGNORECASE=1 FS="Peter|Mary|Paul" inputfile
Output:
1 I am Peter
1 I am Mary
3 I am Peter Peter Peter
2 I am Peter Peter
5 I am Peter peter pAul Mary marY John
You don’t need -o or -n. From grep --help:
-o, --only-matching show only the part of a line matching PATTERN
...
-n, --line-number print line number with output lines
Remove them and your output will be better. I think you’re misinterpreting -n -- it just shows the line number, not the occurrence count.
It looks like you’re trying to get the count of “Peter” appearances per line. You’d need something beyond a single grep for that. awk could be a good choice. Or you could loop over each each line to split into words (say an array) and grep -c the array for each line, to print the line’s count.

Combine results of column one Then sum column 2 to list total for each entry in column one

I am bit of Bash newbie, so please bear with me here.
I have a text file dumped by another software (that I have no control over) listing each user with number of times accessing certain resource that looks like this:
Jim 109
Bob 94
John 92
Sean 91
Mark 85
Richard 84
Jim 79
Bob 70
John 67
Sean 62
Mark 59
Richard 58
Jim 57
Bob 55
John 49
Sean 48
Mark 46
.
.
.
My goal here is to get an output like this.
Jim [Total for Jim]
Bob [Total for Bob]
John [Total for John]
And so on.
Names change each time I run the query in the software, so static search on each name and then piping through wc does not help.
This sounds like a job for awk :) Pipe the output of your program to the following awk script:
your_program | awk '{a[$1]+=$2}END{for(name in a)print name " " a[name]}'
Output:
Sean 201
Bob 219
Jim 245
Mark 190
Richard 142
John 208
The awk script itself can be explained better in this format:
# executed on each line
{
# 'a' is an array. It will be initialized
# as an empty array by awk on it's first usage
# '$1' contains the first column - the name
# '$2' contains the second column - the amount
#
# on every line the total score of 'name'
# will be incremented by 'amount'
a[$1]+=$2
}
# executed at the end of input
END{
# print every name and its score
for(name in a)print name " " a[name]
}
Note, to get the output sorted by score, you can add another pipe to sort -r -k2. -r -k2 sorts the by the second column in reverse order:
your_program | awk '{a[$1]+=$2}END{for(n in a)print n" "a[n]}' | sort -r -k2
Output:
Jim 245
Bob 219
John 208
Sean 201
Mark 190
Richard 142
Pure Bash:
declare -A result # an associative array
while read name value; do
((result[$name]+=value))
done < "$infile"
for name in ${!result[*]}; do
printf "%-10s%10d\n" $name ${result[$name]}
done
If the first 'done' has no redirection from an input file
this script can be used with a pipe:
your_program | ./script.sh
and sorting the output
your_program | ./script.sh | sort
The output:
Bob 219
Richard 142
Jim 245
Mark 190
John 208
Sean 201
GNU datamash:
datamash -W -s -g1 sum 2 < input.txt
Output:
Bob 219
Jim 245
John 208
Mark 190
Richard 142
Sean 201

Resources