Extract lines from File2 already found File1 - linux

Using linux commandline, i need to output the lines from text file2 that are already found in file1.
File1:
C
A
G
E
B
D
H
F
File2:
N
I
H
J
K
M
D
L
A
Output:
A
D
H
Thanks!

You are looking for the tools 'grep'
Check this out.
Lets say you have inputs in file1 & file2 files
grep -f file1 file2
will return you
H
D
A

A more flexible tool to use would be awk
awk 'NR==FNR{lines[$0]++; next} $1 in lines'
Example
$ awk 'NR==FNR{lines[$0]++; next} $1 in lines' file1 file2
H
D
A
What it does?
NR==FNR{lines[$0]++; next}
NR==FNR checks if the file number of records is equal to the overall number of records. This is true only for the first file, file1
lines[$0]++ Here we create an associative array with the line, $0 in file 1 as index.
$0 in lines This line works only for the second file because of the next in previous action. This checks if the line in file 2 is there in the saved array lines, if yes the default action of printing the entire line is taken
Awk is more flexible than the grep as you can columns in file1 with any column in file 2 and decides to print any column rather than printing the entire line

This is what the comm utility does, but you have to sort the files first: To get the lines in common between the 2 files:
comm -12 <(sort File1) <(sort File2)

Related

match 1,2,5 columns of file1 with 1,2,3 columns of file2 respectively and output should have matched rows from file 2. second file is zipped file .gz

file1
3 1234581 A C rs123456
file2 zipped file .gz
1 1256781 rs987656 T C
3 1234581 rs123456 A C
22 1792471 rs928376 G T
output
3 1234581 rs123456 A C
I tried
zcat file2.gz | awk 'NR==FNR{a[$1,$2,$5]++;next} a[$1,$2,$3]' file1.txt - > output.txt
but it is not working
Please try following awk code for your shown samples. Use zcat to read your .gz file and then pass it as 2nd input to awk program for reading, after its done reading with file1.
zcat your_file.gz | awk 'FNR==NR{arr[$1,$2,$5];next} (($1,$2,$3) in arr)' file1 -
Fixes in OP's attempt:
You need not to increment value of array while creating it in file1. Just existence of indexes in it will be enough.
While checking condition in reading file2(passed by zcat command) just check if respective fields are present in array if yes then print that line.

How to remove the lines which appear on file 1 from another file 2 KEEPING empty lines?

I know that in other questions you solve the first part of "How to remove the lines which appear on file 1 from another file 2" with:
comm -23 file1 file2 and grep -Fvxf file1 file2
But in my case I have empty lines separating data sets that I need to keep, for example:
File 1:
A
B
C
D
E
F
G
H
I
File 2
A
D
F
I
what I want as a result:
B
C
E
G
H
The solution can be in bash or csh.
Thanks
With awk please try following once.
awk 'FNR==NR{arr[$0];next} !NF || !($0 in arr)' file2 file1
Explanation: Adding detailed explanation for above code.
awk ' ##Mentioning awk program from here.
FNR==NR{ ##Checking if FNR==NR which will be TRUE when file2 is being read.
arr[$0] ##Creating array with index of $2 here.
next ##next will skip all further statements from here.
}
(!NF || !($0 in arr)) ##If line is empty OR not in arr then print it.
' file2 file1 ##Mentioning Input_file names here.

Linux - Delete lines from file 1 in file 2 BIG DATA

have two files:
file1:
a
b
c
d
file2:
a
b
f
c
d
e
output file (file2) should be:
f
e
I want that the lines of file1 should be deleted directly in file2. I want that the output should be not a new file. It should direct deleted in file 2. Of course there can be created a temp file.
I real file two contains more than 300.000 lines. That is the reason why some solution:
comm -13 file1 file2
don't work.
comm needs the input files to be sorted. You can use process substitution for that:
#!/bin/bash
comm -13 <(sort file1) <(sort file2) > tmp_file
mv tmp_file > original_file
Output:
e
f
Alternatively, if you have enough memory, you can use the following awk command which does not need the input to be sorted:
awk 'NR==FNR{a[$0];next} !($0 in a)' file1 file2
Output (preserved sort order):
f
e
Keep in mind that the size of the array a directly depends on the size of file1.
PS: grep -vFf file1 file2 can also be used and the memory requirements are the same as for the awk solution. Given that, I would probably just use grep.

Get a line from a file and add as a column into another file

I have two files.
File A:
Unix
File B:
A,B
C,D
E,f
.,.
.,.
N,N
Expected Output:
A,B,Unix
C,D,Unix
E,f,Unix
.,.,Unix
.,.,Unix
N,N,Unix
How it is possible with in Linux shell script?
➜ cat a
A,B
C,D
E,f
.,.
.,.
N,N
➜ cat f2
Unix
➜ awk 'BEGIN{getline f2<"f2"}; {printf("%s,%s\n",$0,f2);}' a
A,B,Unix
C,D,Unix
E,f,Unix
.,.,Unix
.,.,Unix
N,N,Unix
Assuming fileA contains only 1 word, it's better to pass it to awk as a parameter.
awk -v v="Unix" 'BEGIN{FS=OFS=","}{$(NF+1)=v}1' fileB
If this fileA contains more words, and assuming there is only 1 per line, you could also use this:
awk 'BEGIN{FS=OFS=","}NR==FNR{a[++i]=$1;next} {for(j=1; j<=i; j++) $(NF+1)=a[j]}1' fileA fileB
And then there is good old paste as an option:
$ cat file1
UNIX
$ cat file2
A,B
C,D
E,F
$ paste -d',' file2 <(yes `cat file1` | head -n $(cat file2 | wc -l))
A,B,UNIX
C,D,UNIX
E,F,UNIX
The tricky part here is that the number of rows differ in file1 and file2, so we need to repeat the UNIX row of file1 as often as necessary (i.e, as many times as there are rows in file2) to be able to use paste.

Print last field in file and use it for name of another file

I have a tab delimited file with 3 rows and 7 columns. I want to use the number at the end of the file to rename another file.
Example of tab delimited file:
a b c d e f g
a b c d e f g
a b c d e f 1235
So, I want to extract the number from tab delimited file and then rename "file1" to the number extracted (mv file1 1235)
I can print the column, but I cannot seem to extract just the number from the file. Even if I can extract the number I can't seem to figure out how to store that number to use as the new file name.
You can use this awk
name=$(awk 'END {print $NF}' file)
mv file $name
something along these lines perhaps?
num=$(tail -1 file1 | rev | awk '{print $1}' | rev)
mv file1 $num
Using a perl one-liner
perl -ne 'BEGIN{($f) = #ARGV} ($n) = /(\d+)$/; END{rename($f, $n)}' file1

Resources