match 1,2,5 columns of file1 with 1,2,3 columns of file2 respectively and output should have matched rows from file 2. second file is zipped file .gz - linux

file1
3 1234581 A C rs123456
file2 zipped file .gz
1 1256781 rs987656 T C
3 1234581 rs123456 A C
22 1792471 rs928376 G T
output
3 1234581 rs123456 A C
I tried
zcat file2.gz | awk 'NR==FNR{a[$1,$2,$5]++;next} a[$1,$2,$3]' file1.txt - > output.txt
but it is not working

Please try following awk code for your shown samples. Use zcat to read your .gz file and then pass it as 2nd input to awk program for reading, after its done reading with file1.
zcat your_file.gz | awk 'FNR==NR{arr[$1,$2,$5];next} (($1,$2,$3) in arr)' file1 -
Fixes in OP's attempt:
You need not to increment value of array while creating it in file1. Just existence of indexes in it will be enough.
While checking condition in reading file2(passed by zcat command) just check if respective fields are present in array if yes then print that line.

Related

combine two csv files based on common column using awk or sed [duplicate]

This question already has answers here:
How to merge two files using AWK? [duplicate]
(4 answers)
Closed 2 years ago.
I have a two CSV file which have a common column in both files along with duplicates in one file. How to merge both csv files using awk or sed?
CSV file 1
5/1/20,user,mark,Type1 445566
5/2/20,user,ally,Type1 445577
5/1/20,user,joe,Type1 445588
5/2/20,user,chris,Type1 445566
CSV file 2
Type1 445566,Name XYZ11
Type1 445577,Name AAA22
Type1 445588,Name BBB33
Type1 445566,Name XYZ11
What I want is?
5/1/20,user,mark,Type1 445566,Name XYZ11
5/2/20,user,ally,Type1 445577,Name AAA22
5/1/20,user,joe,Type1 445588,Name BBB33
5/2/20,user,chris,Type1 445566,Name XYZ11
So is there a bash command in Linux/Unix to achieve this? Can we do this using awk or sed?
Basically, I need to match column 4 of CSV file 1 with column 1 of CSV file 2 and merge both csv's.
Tried following command:
Command:
paste -d, <(cut -d, -f 1-2 ./test1.csv | sed 's/$/,Type1/') test2.csv
Got Result:
5/1/20,user,Type1,Type1 445566,Name XYZ11
If you are able to install the join utility, this command works:
join -t, -o 1.1 1.2 1.3 2.1 2.2 -1 4 -2 1 file1.csv file2.csv
Explanation:
-t, identify the field separator as comma (',')
-o 1.1 1.2 1.3 2.1 2.2 format the output to be "file1col1, file1col2, file1col3, file2col1, file2col2`
-1 4 join by column 4 in file1
-2 1 join by column 1 in file2
For additional usage information for join, reference the join manpage.
Edit: You specifically asked for the solution using awk or sed so here is the awk implementation:
awk -F"," 'NR==FNR {a[$1] = $2; next} {print $1","$2","$3","$4"," a[$4]}' \
file2.csv \
file1.csv
Explanation:
-F"," Delimit by the comma character
NR==FNR Read the first file argument (notice in the above solution that we're passing file2 first)
{a[$1] = $2; next} In the current file, save the contents of Column2 in an array that uses Column1 as the key
{print $1","$2","$3","$4"," a[$4]} Read file1 and using Column4, match the value to the key's value from the array. Print Column1, Column2, Column3, Column4, and the key's value.
The two example input files seem to be already appropriately sorted, so you just have to put them side by side, and paste is good for this; however you want to remove some ,-separated columns from file1, and you can use cut for that; but you also want to insert another (constant) column, and sed can do it. A possible command is this:
paste -d, <(cut -d, -f 1-2 file1 | sed 's/$/,abcd/') file2
Actually sed can do the whole processing of file1, and the output can be pided into paste, which uses - to capture it from the standard input:
sed -E 's/^(([^,]+,){2}).*/\1abcd/' file1 | paste -d, - file2

Linux - Delete lines from file 1 in file 2 BIG DATA

have two files:
file1:
a
b
c
d
file2:
a
b
f
c
d
e
output file (file2) should be:
f
e
I want that the lines of file1 should be deleted directly in file2. I want that the output should be not a new file. It should direct deleted in file 2. Of course there can be created a temp file.
I real file two contains more than 300.000 lines. That is the reason why some solution:
comm -13 file1 file2
don't work.
comm needs the input files to be sorted. You can use process substitution for that:
#!/bin/bash
comm -13 <(sort file1) <(sort file2) > tmp_file
mv tmp_file > original_file
Output:
e
f
Alternatively, if you have enough memory, you can use the following awk command which does not need the input to be sorted:
awk 'NR==FNR{a[$0];next} !($0 in a)' file1 file2
Output (preserved sort order):
f
e
Keep in mind that the size of the array a directly depends on the size of file1.
PS: grep -vFf file1 file2 can also be used and the memory requirements are the same as for the awk solution. Given that, I would probably just use grep.

Iterate over two files in linux || column comparison

we have two files File1 and File 2
File 1 columns
Name Age
abc 12
bcd 14
File2 Columns
Age
12
14
I was to Iterate over the second column of File1 and First column of File2 in single loop and Then check if they are same.
Note:- note number of Rows in both the files are same and I am using .sh shell
First make a temporary file from file1 that should be the same as file2.
The field name might have spaces, so remove everything until the last space.
When you have done this you can compare the files.
sed 's/.* //' file1 > file1.tmp
diff file1.tmp file2

Linux split a file in two columns

I have the following file that contains 2 columns :
A:B:IP:80 apples
C:D:IP2:82 oranges
E:F:IP3:84 grapes
How is possible to split the file in 2 other files, each column in a file like this:
File1
A:B:IP:80
C:D:IP2:82
E:F:IP3:84
File2
apples
oranges
grapes
Try:
awk '{print $1>"file1"; print $2>"file2"}' file
After runningl that command, we can verify that the desired files have been created:
$ cat file1
A:B:IP:80
C:D:IP2:82
E:F:IP3:84
And:
$ cat file2
apples
oranges
grapes
How it works
print $1>"file1"
This tells awk to write the first column to file1.
print $2>"file2"
This tells awk to write the second column to file2.
Perl 1-liner using (abusing) the fact that print goes to STDOUT, i.e. file descriptor 1, and warn goes to STDERR, i.e. file descriptor 2:
# perl -n means loop over the lines of input automatically
# perl -e means execute the following code
# chomp means remove the trailing newline from the expression
perl -ne 'chomp(my #cols = split /\s+/); # Split each line on whitespace
print $cols[0] . "\n";
warn $cols[1] . "\n"' <input 1>col1 2>col2
You could, of course, just use cut -b with the appropriate columns, but then you would need to read the file twice.
Here's an awk solution that'll work with any number of columns:
awk '{for(n=1;n<=NF;n++)print $n>"File"n}' input.txt
This steps through each field on the line and prints the field to a different output file based on the column number.
Note that blank fields -- or rather, lines with fewer fields than other lines, will cause line numbers to mismatch. That is, if your input is:
A 1
B
C 3
Then File2 will contain:
1
3
If this is a concern, mention it in an update to your question.
You could of course do this in bash alone, in a number of ways. Here's one:
while read -r line; do
a=($line)
for m in "${!a[#]}"; do
printf '%s\n' "${a[$m]}" >> File$((m+1))
done
done < input.txt
This reads each line of input into $line, then word-splits $line into values in the $a[] array. It then steps through that array, printing each item to the appropriate file, named for the index of the array (plus one, since bash arrays start at zero).

Bash: grep selected text from a file

I have two files, file1 :
abc/def/ghi/ss/sfrere/sfs
xyz/pqr/sef/ert/wwqwq/bh
file2:
ind abc def
bcf pqr sss
i wish to grep text file from file1, such that any words on any line of file2 match on one line of file1, so in this case answer would be first line, as abc and def are present in first line of file1. 2 or more words from lines of flie 1 should match in any line of file 2.
This should do the trick,
awk 'FNR==NR{a[$1];next}{for(i in a){c=0;for(j=1;j<=NF;j++){if(index(i,$j)>0)c++}if(c>=2)print i}}' file1.txt file2.txt
Explanation
FNR==NR{a[$1];next} will iterate through first File1.txt and store lines in a.
for(i in a) will loop through the above stored lines,
c=0 just to have a number check to keep track of number of columns matched.
for(j=1;j<NF;j++) loop through columns in lines of File2.txt
if(index(i,$j)>0)c++ increment counter if one of the columns in File2.txt is in a line of File1.txt.
if(c>=2)print i Your given condition that it should match at least 2 columns, then we print line from File1.txt.
This is the most straight forward way that I could think of, I'm sure there are crazier ways to do this.
on huge file
sed 's/\([^ ]*\) \([^ ]*\) \([^ ]*\)/(\1.*\2)|(\2.*\1)|(\1.*\3)|(\3.*\1)|(\2.*\3)|(\3.*\2)/' file2 >/tmp/file2.egrep
egrep -f /tmp/file2.egrep file1
rm >/tmp/file2.egrep
create a temporary pattern matching for egrep based on file2 content

Resources