Join two files linux - linux

I am trying to join two files but they don't have the same number of lines. I need to join them by the second column.
File1:
11#San Noor#New York, US
22#Maria Shiry#Dubai, UA
55#John Smith#London, England
66#Viki Sam#Roman, Italy
81#Sara Moheeb#Montreal, Canada
File2:
C1#Steve White#11
C2#Hight Look#21
E1#The Heaven is more#52
I1#The Roma Seen#55
The output should be:
The output for paired lines should look like:
San Noor#Sereve White
The output for unpairable lines should look like:
Sara Moheeb#NA
(The file3 after joining should contain 5 lines and look as followed.)
San Noor#Steve White
Maria Shiry#Hight Look
John Smith#The Heaven is more
Viki Sam#The Roma Seen
Sara Moheeb#NA
I have tried to join these two files using this command:
join -t '#' -j2 -e "NA" <(sort -t '#' -k2 File1) <(sort -t '#' -k2 File2) > File3
It says that both files are not sorted. Also, I need a way to fill in missing values after join.

Extract relevant columns and paste them together.
paste -d '#' <(cut -d '#' -f2 file1) <(cut -d '#' -f2 file2)
Well, but this will fail for the NA case, when one file has less lines then the other. You could pipe it to something along awk -v OFS='#' -F'#' { for (i=1;i<NF;++i) if (length($i) == 0) $i="NA"; } to substitute empty fields for the string NA.
So I guess your method is a possible one, but you have nothing to "join" on the files. So join on an a imaginary column with line numbers:
join -t'#' -eNA -a1 -a2 -o1.2,2.2 <(cut -d'#' -f2 file1 | nl -w1 -s'#') <(cut -d'#' -f2 file2 | nl -w1 -s'#')

Related

How to print only words that doesn't match between two files? [duplicate]

This question already has answers here:
Compare two files line by line and generate the difference in another file
(14 answers)
Closed 2 years ago.
FILE1:
cat
dog
house
tree
FILE2:
dog
cat
tree
I need to be printed only:
house
$ cat file1
cat
dog
house
tree
$ cat file2
dog
cat
tree
$ grep -vF -f file2 file1
house
The -v flag only shows non-matches, -f is for a filename to use as a filter, and -F is for exact matches (doesn't slow it down with any pattern matching).
Using awk
awk 'FNR==NR{arr[$0]=1; next} !($0 in arr)' FILE2 FILE1
First build an associative array with words from FILE2 and than loop over FILE1 and only print those.
Using comm
comm -2 -3 <(sort FILE1) <(sort FILE2)
-2 suppresses lines unique to FILE2 and -3 suppresses lines found in both.
If you want just the words, you can sort the files, diff them, then use sed to filter out diff's symbols:
diff <(sort file1) <(sort file2) | sed -n '/^</s/^< //p'
Awk is an option here:
awk 'NR==FNR { arr[$1]="1" } NR != FNR { if (arr[$1] == "") { print $0 } } ' file2 file1
Create an array called arr, using the contents of file2 as indexes. Then with file1, look at each entry and check to see if an entry in the array arr exists. If it doesn't, print.

comparing files by fields in bash

I have two arbitrary files:
==> file1 <==
11110 abcdef
11111 apple
11112 banana
11113 carrot
11114 date
11115 eggplant
==> file2 <==
11110 abcdefg
11111 apple-pie
11112 banana-cake
11113 chocolate
11115 egg
11116 fruit
For the sake of comparison of these files, I only care about the number in the first column, the words after the break are unimportant.
I want to be able to readily identify numbers that are missing from each file.
For example, file 1 has no 11116 and file 2 has no 11114.
If I sort the files together I can get a complete list:
$ sort file*
11110 abcdef
11110 abcdefg
11111 apple
11111 apple-pie
11112 banana
11112 banana-cake
11113 carrot
11113 chocolate
11114 date
11115 egg
11115 eggplant
11116 fruit
I can get a list of all the numbers by running it through uniq and only comparing the length of the number:
$ sort file* | uniq -w5
11110 abcdef
11111 apple
11112 banana
11113 carrot
11114 date
11115 egg
11116 fruit
That's a list of all numbers 11110-11116.
I can get a list of uniques and duplicates by asking uniq to filter those for me:
duplicates (numbers that appear in both files):
$ sort file* | uniq -dw5
11110 abcdef
11111 apple
11112 banana
11113 carrot
11115 egg
unique numbers, or numbers that only appear in one file:
$ sort file* | uniq -uw5
11114 date
11116 fruit
I would like something that has output resembling:
# shows numbers that do not exist in this file
$ sort file* | <is missing>
==> file1 <==
11116 fruit
==> file2 <==
11114 date
It could do the reverse and show what numbers are missing from the OTHER file, each case is workable:
# shows numbers that do exist ONLY in this file
$ sort file* | <has unqie>
==> file1 <==
11114 date
==> file2 <==
11116 fruit
The first field will contain ~30 alphanumeric characters.
The files in question contain thousands of entries and the majority of entries are expected to be in both files.
The arbitrary data to the right of the number is relevant and needs to remain.
I had the idea of:
generate a complete list of numbers
compare that list with file1 searching for unique entries
compare that list with file2 searching for unique entries
But I can't work out how to do that on a single line:
sort file* | uniq -w5 | sort file1 | uniq -uw5
sort file* | uniq -w5 | sort file2 | uniq -uw5
However, the output of the first uniq doesn't get merged in with the resorting of file1/2...
The solution I came up with was to create the output of all the numbers:
$ sort file* | uniq -w5
and then run that against each file individually, that does work. I just couldn't piece it together on one line:
$ sort all file1 | uniq -uw5
11116 fruit
$ sort all file2 | uniq -uw5
11114 date
I am now working on incorporating join, thanks Kamil
edit: I never got to go any further myself, #Shawn gave it to me in one very short line:
join -j1 -v1 file1 file2
After I have two compiled lists in the format I require, a join performed on the files spits out the required answer. From my code examples above:
$join -j1 -v1 file1 file2
11114 date
$ join -j1 -v2 file1 file2
11116 fruit
A real world Example:
I thought I would generate a real world example of what I have been working on. Take 5 arbitrary files:
lorem1.txt
lorem2.txt
lorem3.txt
lorem4.txt
lorem5.txt
and make a backup of them. I have modified one bit in lorem2.txt and I removed `lorem4.txt from the backup (consider it a new file, or for whatever reason, it is just a missing file):
test$ tree
.
├── data
│   ├── lorem1.txt
│   ├── lorem2.txt
│   ├── lorem3.txt
│   ├── lorem4.txt
│   └── lorem5.txt
└── data-backup
├── lorem1.txt
├── lorem2.txt
├── lorem3.txt
└── lorem5.txt
2 directories, 9 files
mad#test$ md5deep data/* | sort > hash1
mad#test$ md5deep data-backup/* | sort > hash2
mad#test$ head hash*
==> hash1 <==
44da5caec444b6f00721f499e97c857a /test/data/lorem1.txt
5ba24c9a5f6d74f81499872877a5061d /test/data/lorem2.txt
a00edd450c533091e0f62a06902545a4 /test/data/lorem5.txt
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
fb8f7f39344394c78ab02d2ac524df9d /test/data/lorem3.txt
==> hash2 <==
000e755b8e840e42d50ef1ba5c7ae45d /test/data-backup/lorem2.txt
44da5caec444b6f00721f499e97c857a /test/data-backup/lorem1.txt
a00edd450c533091e0f62a06902545a4 /test/data-backup/lorem5.txt
fb8f7f39344394c78ab02d2ac524df9d /test/data-backup/lorem3.txt
Running our joins:
join 1
mad#test$ join -j1 -v1 hash*
5ba24c9a5f6d74f81499872877a5061d /test/data/lorem2.txt
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
From our two sets of hash files, joining them verified against the first file, we see that the matching hashes of lorem2.txt and lorem4.txtare missing from the second file. (lorem2because we changed a bit, andlorem4` because we didn't copy, or we deleted the file from the backup).
Doing the reverse join we can see lorem2 exists, it's just that the hash is incorrect:
join 2
mad#test$ join -j1 -v2 hash*
000e755b8e840e42d50ef1ba5c7ae45d /test/data-backup/lorem2.txt
Using my sort and uniq examples from earlier, I could get similar results, but the join above is much better. join1 shows us files we need to revisit, join2 specifically shows us what hashes are incorrect.
sort by name and show uniq names (which was way outside the scope of the original question) can show us files that are missing from the backup. In this example, I convert the backup filenames so they mimic the original filenames, merge/sort them with the original filenames and sort based only on the names, not the hashes. This will show files that are missing from the backup:
test$ sort -k2 hash1 <(sed 's/data-backup/data/g' hash2) | uniq -uf1
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
if we had a file that contained all the hashes:
test$ sort -k2 hash1 allhashes | uniq -uf1
b80118923d16f649dd5410d54e5acb2d /test/data/lorem4.txt
Thanks again to everyone who helped me formulate this. It has turned into a real life and time saver.
Using gnu awk, you can make use of this approach:
awk 'ARGIND < ARGC-1 {
a[ARGIND][$1] = 1
next
} {
for (i=1; i<ARGC-1; i++)
if (!a[i][$1])
print ARGV[i] ":", $0
}' file1 file2 <(sort file1 file2)
file2: 11114 date
file1: 11116 fruit
Only in file1:
grep `comm -23 <(cut -d \ -f 1 f1 | sort) <(cut -d \ -f 1 f2 | sort)` f1
This awk version only takes one pass through each file:
It assumes that there are no duplicate IDs in a file.
awk '
NR == FNR {f1[$1] = $0; next}
!($1 in f1) {printf "only in %s: %s\n", FILENAME, $0}
$1 in f1 {delete f1[$1]}
END {for (id in f1) printf "only in %s: %s\n", ARGV[1], f1[id]}
' file1 file2
ouputs
only in file2: 11116 fruit
only in file1: 11114 date
You can use diff between 2 files. However, if you diff these files all the lines will listed.
$ diff file1 file2
1,6c1,6
< 11110 abcdef
< 11111 apple
< 11112 banana
< 11113 carrot
< 11114 date
< 11115 eggplant
---
> 11110 abcdefg
> 11111 apple-pie
> 11112 banana-cake
> 11113 chocolate
> 11115 egg
> 11116 fruit
But you only care about the leading numbers.
$ diff <(cut -d' ' -f1 file1) <(cut -d' ' -f1 file2)
5d4
< 11114
6a6
> 11116
If the files are not sorted then add a sort
$ diff <(cut -d' ' -f1 file1 | sort) <(cut -d' ' -f1 file2 | sort)
5d4
< 11114
6a6
> 11116

Bash: join by numeric column

If I want to use join on my Ubuntu, I need to first sort both files lexicographically (according to join --help), and only then join them:
tail -n +2 meta/201508_1 | sort -k 1b,1 > meta.txt
tail -n +2 keywords/copy | sort -k 1b,1 > keywords.txt
join meta.txt keywords.txt -1 1 -2 1 -t $'\t'
(I also remove the header from both of them using tail)
But instead of sorting files lexicographically, I would like to sort them numerically: the first column in both files is an ID.
tail -n +2 meta/201508_1 | sort -k1 -n > meta.txt
tail -n +2 keywords/copy.txt | sort -k1 -n > keywords.txt
And then join. But for join these files look unsorted:
join: meta.txt:10: is not sorted: 1023 301000 en
join: keywords.txt:2: is not sorted: 10 keyword1
If I add --nocheck-order to join, it doesn't join properly - it outputs just one line.
How do I join two files on their numerical ID in bash?
Sample (columns are tab-separated):
file 1
id volume lang
1 10 en
2 20 en
5 30 en
6 40 en
10 50 en
file 2
id keyword
4 kw1
2 kw2
10 kw3
1 kw4
3 kw5
desired output
1 kw4 10 en
2 kw2 20 en
10 kw3 50 en
Both of these work. The first one (sort -b is recommended on the Mac)
join <(sed 1d file1 | sort -b) <(sed 1d file2 | sort -b) | sort -n
the Linux man page recommends sort -k 1b,1
join <(sed 1d file1 | sort -k 1b,1) <(sed 1d file2 | sort -k 1b,1) | sort -n
In any case, you need to sort them lexicographically to join them. At the end you can still sort the result numerically.
You can ditch join and use awk instead:
awk -F'\t' 'FNR==1{next} NR==FNR{a[$1]=$2; next} $1 in a{print $1, a[$1], $2, $3}' file2 file1 | column -t
1 kw4 10 en
2 kw2 20 en
10 kw3 50 en
It is probably already in the order that you want (as per the ID column in file1). However if you need specific sorting you can do:
awk -F'\t' 'FNR==1{next} NR==FNR{a[$1]=$2; next} $1 in a{
print $1, a[$1], $2, $3}' file2 file1 | sort -nk1 | column -t
Note that column -t is there to produce tabular formatted output.

How to extract some missing rows by comparing two different files in linux?

I have two diferrent files which some rows are missing in one of the files. I want to make a new file including those non-common rows between two files. as and example, I have following files:
file1:
id1
id22
id3
id4
id43
id100
id433
file2:
id1
id2
id22
id3
id4
id8
id43
id100
id433
id21
I want to extract those rows which exist in file2 but do not in file1:
new file:
id2
id8
id21
any suggestion please?
Use the comm utility (assumes bash as the shell):
comm -13 <(sort file1) <(sort file2)
Note how the input must be sorted for this to work, so your delta will be sorted, too.
comm uses an (interleaved) 3-column layout:
column 1: lines only in file1
column 2: lines only in file2
column 3: lines in both files
-13 suppresses columns 1 and 2, which prints only the values exclusive to file2.
Caveat: For lines to be recognized as common to both files they must match exactly - seemingly identical lines that differ in terms of whitespace (as is the case in the sample data in the question as of this writing, where file1 lines have a trailing space) will not match.
cat -et is a command that visualizes line endings and control characters, which is helpful in diagnosing such problems.
For instance, cat -et file1 would output lines such as id1 $, making it obvious that there's a trailing space at the end of the line (represented as $).
If instead of cleaning up file1 you want to compare the files as-is, try:
comm -13 <(sed -E 's/ +$//' file1 | sort) <(sort file2)
A generalized solution that trims leading and trailing whitespace from the lines of both files:
comm -13 <(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file1 | sort) \
<(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file2 | sort)
Note: The above sed commands require either GNU or BSD sed.
Edit: I only wanted to change 1 character but 6 is the minimum... Delete this...
You can try to sort both files then count the duplicate rows and select only those row where the count is 1
sort file1 file2 | uniq -c | awk '$1 == 1 {print $2}'

Find value from one csv in another one (like vlookup) in bash (Linux)

I have already tried all options that I found online to solve my issue but without good result.
Basically I have two csv files (pipe separated):
file1.csv:
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
file2.csv:
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
I need a linux bash script to find the value of pos.3 from file2 based on the content of pos7 in file1.
Example:
file1, line1, pos 7: MAYOBAN
find MAYOBAN in file2, return pos 3 (2400)
the output should be something like this:
**2400**
**2200**
**2200**
**etc...**
Please help
Jacek
A little approach, far away to be perfect:
DELIMITER="|"
for i in $(cut -f 7 -d "${DELIMITER}" file1.csv );
do
grep "${i}" file2.csv | cut -f 3 -d "${DELIMITER}";
done
This will work, but since the input files must be sorted, the output order will be affected:
join -t '|' -1 7 -2 1 -o 2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output would look like:
2200
2200
2400
which is useless. In order to have a useful output, include the key value:
join -t '|' -1 7 -2 1 -o 0,2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output then looks like this:
CORKCOR|2200
CORKKIN|2200
MAYOBAN|2400
Edit:
Here's an AWK version:
awk -F '|' 'FNR == NR {keys[$7]; next} {if ($1 in keys) print $3}' file1.csv file2.csv
This loops through file1.csv and creates array entries for each value of field 7. Simply referring to an array element creates it (with a null value). FNR is the record number in the current file and NR is the record number across all files. When they're equal, the first file is being processed. The next instruction reads the next record, creating a loop. When FNR == NR is no longer true, the subsequent file(s) are processed.
So file2.csv is now processed and if it has a field 1 that exists in the array, then its field 3 is printed.
You can use Miller (https://github.com/johnkerl/miller).
Starting from input01.txt
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
and input02.txt
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
and running
mlr --csv -N --ifs "|" join -j 7 -l 7 -r 1 -f input01.txt then cut -f 3 input02.txt
you will have
2400
2200
2200
Some notes:
-N to set input and output without header;
--ifs "|" to set the input fields separator;
-l 7 -r 1 to set the join fields of the input files;
cut -f 3 to extract the field named 3 from the join output
cut -d\| -f7 file1.csv|while read line
do
grep $line file1.csv|cut -d\| -f3
done

Resources