Shell Command 'join' not working - linux

I am trying to join 2 sorted simple file, but for some strange reason, its not working.
f1.txt:
f1 abc
f2 mno
f3 pqr
f2.txt
abc a1
mno a2
pqr a3
Command:
join -t '\t' f1.txt f2.txt -1 2 -2 1 > f3.txt
FYI in f1, f2 the space is actually a tab.
I don't know why this is not working. F3.txt is forming empty.
Please provide any valuable insights.

Using bash join on 2nd column of 1st file and 1st column on 2nd file
$ join -1 2 -2 1 file1 file2 > file3
$ cat file3
abc f1 a1
mno f2 a2
pqr f3 a3
Also join by default de-limits on tab-space characters. The man page of join says the following about the -t flag.
-t CHAR
use CHAR as input and output field separator.
Unless -t CHAR is given, leading blanks separate fields and are ignored,

Related

How to merge two CSV files with Linux column wise? [duplicate]

This question already has answers here:
combining columns of 2 files using shell script
(1 answer)
Are shell scripts sensitive to encoding and line endings?
(14 answers)
Closed 1 year ago.
I am looking for a simple line of code (if possible) to simply merge two files column wise and save the final result into a new file.
Edited to response to the first answer #heitor:
By using paste file1.csv file2.csv, What happened is:
For instance file 1:
A B
1 2
file2:
C D
3 4
By doing paste -d , file1.csv file2.csv >output.csv I got
A B
C D
1 2
3 4
not
A B C D
1 2 3 4
By doing cat file1.csv file2.csv I got
A B
1 2
C D
3 4
Neither of them is what I want. Any idea?
Any idea?
Use paste -d , to merge the two files and > to redirect the command output to another file:
$ paste -d , file1.csv file2.csv > output.csv
E.g.:
$ cat file1.csv
A,B
$ cat file2.csv
C,D
$ paste -d , file1.csv file2.csv > output.csv
$ cat output.csv
A,B,C,D
-d , tells paste to use , as the delimiter to join the columns.
> tells the shell to write the output of the paste command to the file output.csv
Indeed using paste is pretty simple,
$ cat file1.csv
A B
1 2
$ cat file2.csv
C D
3 4
$ paste -d " " file1.csv file2.csv
A B C D
1 2 3 4
With the -d option I replaced the default tab character with a space.
Edit:
In case you want to redirect that to another file then,
paste -d " " file1.csv file2.csv > file3.csv
$ cat file3.csv
A B C D
1 2 3 4

diff 2 files with an output that does not include extra lines

I have 2 files test and test1 and I would like to do a diff between them without the output having extra characters 2a3, 4a6, 6a9 as shown below.
mangoes
apples
banana
peach
mango
strawberry
test1:
mangoes
apples
blueberries
banana
peach
blackberries
mango
strawberry
star fruit
when I diff both the files
$ diff test test1
2a3
> blueberries
4a6
> blackberries
6a9
> star fruit
How do I get the output as
$ diff test test1
blueberries
blackberries
star fruit
A solution using comm:
comm -13 <(sort test) <(sort test1)
Explanation
comm - compare two sorted files line by line
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files
As we only need the lines unique to the second file test1, -13 is used to suppress the unwanted columns.
Process Substitution is used to get the sorted files.
You can use grep to filter out lines that are not different text:
$ diff file1 file2 | grep '^[<>]'
> blueberries
> blackberries
> star fruit
If you want to remove the direction indicators that indicate which file differs, use sed:
$ diff file1 file2 | sed -n 's/^[<>] //p'
blueberries
blackberries
star fruit
(But it may be confusing to not see which file differs...)
You can use awk
awk 'NR==FNR{a[$0];next} !($0 in a)' test test1
NR==FNR means currently first file on the command line (i.e. test) is being processed,
a[$0] keeps each record in array named a,
next means read next line without doing anything else,
!($0 in a) means if current line does not exist in a, print it.

Comparing two files using Awk

I have two text files one with a list of ids and another one with some id and corresponding values.
File 1
abc
abcd
def
cab
kac
File 2
abcd 100
def 200
cab 500
kan 400
So, I want to compare both the files and fetch the value of matching columns and also keep all the id from File 1 and assign "NA" to the ids that don't have a value in File2
Desired output
abc NA
abcd 100
def 200
cab 500
kac NA
PS: Only Awk script/One-liners
The code I'm using to print matching columns:
awk 'FNR==NR{a[$1]++;next}a[$1]{print $1,"\t",$2}'
$ awk 'NR==FNR{a[$1]=$2;next} {print $1, ($1 in a? a[$1]: "NA") }' file2 file1
abc NA
abcd 100
def 200
cab 500
kac NA
Using join and sort (hopefully portable):
export LC_ALL=C
sort -k1 file1 > /tmp/sorted1
sort -k1 file2 > /tmp/sorted2
join -a 1 -e NA -o 0,2.2 /tmp/sorted1 /tmp/sorted2
In bash you can use here-files in a single line:
LC_ALL=C join -a 1 -e NA -o 0,2.2 <(LC_ALL=C sort -k1 file1) <(LC_ALL=C sort -k1 file2)
Note 1, this gives output sorted by 1st column:
abc NA
abcd 100
cab 500
def 200
kac NA
Note 2, the commands may work even without LC_ALL=C. Important is that all sort and join commands are using the same locale.

4 lines invert grep search in a directory that contains many files

I have many log files in a directory. In those files, there are many lines. Some of these lines contain ERROR word.
I am using grep ERROR abc.* to get error lines from all the abc1,abc2,abc3,etc files.
Now, there are 4-5 ERROR lines that I want to avoid.
So, I am using
grep ERROR abc* | grep -v 'str1\| str2'
This works fine. But when I insert 1 more string,
grep ERROR abc* | grep -v 'str1\| str2\| str3
it doesn’t get affected.
I need to avoid 4-5 strings.. can anybody suggest a solution?
You are using multiple search pattern, i.e. in a way a regex expression. -E in grep supports an extended regular expression as you can see from the man page below
-e PATTERN, --regexp=PATTERN
Use PATTERN as the pattern. This can be used to specify multiple search patterns, or to protect a pattern beginning with a hyphen (-). (-e is specified by POSIX.)
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
So you need to use the -E flag along with the -v invert search
grep ERROR abc* | grep -Ev 'str1|str2|str3|str4|str5'
An example of the usage for your reference:-
$ cat sample.txt
ID F1 F2 F3 F4 ID F1 F2 F3 F4
aa aa
bb 1 2 3 4 bb 1 2 3 4
cc 1 2 3 4 cc 1 2 3 4
dd 1 2 3 4 dd 1 2 3 4
xx xx
$ grep -vE "aa|xx|yy|F2|cc|dd" sample.txt
bb 1 2 3 4 bb 1 2 3 4
Your example should work, but you can also use
grep ERROR abc* | grep -e 'str1' -e 'str2' -e 'str3' -v

linux diff by ignoring unnecessary lines

diff -I option does't work for me when there is an mismatch before skipped lines.
File1:
a1
* b
File2:
a2
* c
$ diff -I '*' File1 File2
< a1
< * b
> a2
> * c
But if in both files the first line is "a1", the output will be clear.
Is there any suggestions how to skip lines when there is an mismatch before that lines?
Thanks.
The behaviour that you're observing can be well explained by this comment.
To elaborate, if the input files were to read:
$ cat 1
a1
* b
$ cat 2
a2
* c
then diff with -I would give you the expected output:
$ diff -I$'*' 1 2
1c1
< a1
---
> a2
In your case, you might use alternatives such as:
$ diff <(sed '/^\*/d' 1) <(sed '/^\*/d' 2)
1c1
< a1
---
> a2

Resources