Comparing two files using Awk - linux

I have two text files one with a list of ids and another one with some id and corresponding values.
File 1
abc
abcd
def
cab
kac
File 2
abcd 100
def 200
cab 500
kan 400
So, I want to compare both the files and fetch the value of matching columns and also keep all the id from File 1 and assign "NA" to the ids that don't have a value in File2
Desired output
abc NA
abcd 100
def 200
cab 500
kac NA
PS: Only Awk script/One-liners
The code I'm using to print matching columns:
awk 'FNR==NR{a[$1]++;next}a[$1]{print $1,"\t",$2}'

$ awk 'NR==FNR{a[$1]=$2;next} {print $1, ($1 in a? a[$1]: "NA") }' file2 file1
abc NA
abcd 100
def 200
cab 500
kac NA

Using join and sort (hopefully portable):
export LC_ALL=C
sort -k1 file1 > /tmp/sorted1
sort -k1 file2 > /tmp/sorted2
join -a 1 -e NA -o 0,2.2 /tmp/sorted1 /tmp/sorted2
In bash you can use here-files in a single line:
LC_ALL=C join -a 1 -e NA -o 0,2.2 <(LC_ALL=C sort -k1 file1) <(LC_ALL=C sort -k1 file2)
Note 1, this gives output sorted by 1st column:
abc NA
abcd 100
cab 500
def 200
kac NA
Note 2, the commands may work even without LC_ALL=C. Important is that all sort and join commands are using the same locale.

Related

Join two files linux

I am trying to join two files but they don't have the same number of lines. I need to join them by the second column.
File1:
11#San Noor#New York, US
22#Maria Shiry#Dubai, UA
55#John Smith#London, England
66#Viki Sam#Roman, Italy
81#Sara Moheeb#Montreal, Canada
File2:
C1#Steve White#11
C2#Hight Look#21
E1#The Heaven is more#52
I1#The Roma Seen#55
The output should be:
The output for paired lines should look like:
San Noor#Sereve White
The output for unpairable lines should look like:
Sara Moheeb#NA
(The file3 after joining should contain 5 lines and look as followed.)
San Noor#Steve White
Maria Shiry#Hight Look
John Smith#The Heaven is more
Viki Sam#The Roma Seen
Sara Moheeb#NA
I have tried to join these two files using this command:
join -t '#' -j2 -e "NA" <(sort -t '#' -k2 File1) <(sort -t '#' -k2 File2) > File3
It says that both files are not sorted. Also, I need a way to fill in missing values after join.
Extract relevant columns and paste them together.
paste -d '#' <(cut -d '#' -f2 file1) <(cut -d '#' -f2 file2)
Well, but this will fail for the NA case, when one file has less lines then the other. You could pipe it to something along awk -v OFS='#' -F'#' { for (i=1;i<NF;++i) if (length($i) == 0) $i="NA"; } to substitute empty fields for the string NA.
So I guess your method is a possible one, but you have nothing to "join" on the files. So join on an a imaginary column with line numbers:
join -t'#' -eNA -a1 -a2 -o1.2,2.2 <(cut -d'#' -f2 file1 | nl -w1 -s'#') <(cut -d'#' -f2 file2 | nl -w1 -s'#')

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

Bash: join by numeric column

If I want to use join on my Ubuntu, I need to first sort both files lexicographically (according to join --help), and only then join them:
tail -n +2 meta/201508_1 | sort -k 1b,1 > meta.txt
tail -n +2 keywords/copy | sort -k 1b,1 > keywords.txt
join meta.txt keywords.txt -1 1 -2 1 -t $'\t'
(I also remove the header from both of them using tail)
But instead of sorting files lexicographically, I would like to sort them numerically: the first column in both files is an ID.
tail -n +2 meta/201508_1 | sort -k1 -n > meta.txt
tail -n +2 keywords/copy.txt | sort -k1 -n > keywords.txt
And then join. But for join these files look unsorted:
join: meta.txt:10: is not sorted: 1023 301000 en
join: keywords.txt:2: is not sorted: 10 keyword1
If I add --nocheck-order to join, it doesn't join properly - it outputs just one line.
How do I join two files on their numerical ID in bash?
Sample (columns are tab-separated):
file 1
id volume lang
1 10 en
2 20 en
5 30 en
6 40 en
10 50 en
file 2
id keyword
4 kw1
2 kw2
10 kw3
1 kw4
3 kw5
desired output
1 kw4 10 en
2 kw2 20 en
10 kw3 50 en
Both of these work. The first one (sort -b is recommended on the Mac)
join <(sed 1d file1 | sort -b) <(sed 1d file2 | sort -b) | sort -n
the Linux man page recommends sort -k 1b,1
join <(sed 1d file1 | sort -k 1b,1) <(sed 1d file2 | sort -k 1b,1) | sort -n
In any case, you need to sort them lexicographically to join them. At the end you can still sort the result numerically.
You can ditch join and use awk instead:
awk -F'\t' 'FNR==1{next} NR==FNR{a[$1]=$2; next} $1 in a{print $1, a[$1], $2, $3}' file2 file1 | column -t
1 kw4 10 en
2 kw2 20 en
10 kw3 50 en
It is probably already in the order that you want (as per the ID column in file1). However if you need specific sorting you can do:
awk -F'\t' 'FNR==1{next} NR==FNR{a[$1]=$2; next} $1 in a{
print $1, a[$1], $2, $3}' file2 file1 | sort -nk1 | column -t
Note that column -t is there to produce tabular formatted output.

How to extract some missing rows by comparing two different files in linux?

I have two diferrent files which some rows are missing in one of the files. I want to make a new file including those non-common rows between two files. as and example, I have following files:
file1:
id1
id22
id3
id4
id43
id100
id433
file2:
id1
id2
id22
id3
id4
id8
id43
id100
id433
id21
I want to extract those rows which exist in file2 but do not in file1:
new file:
id2
id8
id21
any suggestion please?
Use the comm utility (assumes bash as the shell):
comm -13 <(sort file1) <(sort file2)
Note how the input must be sorted for this to work, so your delta will be sorted, too.
comm uses an (interleaved) 3-column layout:
column 1: lines only in file1
column 2: lines only in file2
column 3: lines in both files
-13 suppresses columns 1 and 2, which prints only the values exclusive to file2.
Caveat: For lines to be recognized as common to both files they must match exactly - seemingly identical lines that differ in terms of whitespace (as is the case in the sample data in the question as of this writing, where file1 lines have a trailing space) will not match.
cat -et is a command that visualizes line endings and control characters, which is helpful in diagnosing such problems.
For instance, cat -et file1 would output lines such as id1 $, making it obvious that there's a trailing space at the end of the line (represented as $).
If instead of cleaning up file1 you want to compare the files as-is, try:
comm -13 <(sed -E 's/ +$//' file1 | sort) <(sort file2)
A generalized solution that trims leading and trailing whitespace from the lines of both files:
comm -13 <(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file1 | sort) \
<(sed -E 's/^[[:blank:]]+|[[:blank:]]+$//g' file2 | sort)
Note: The above sed commands require either GNU or BSD sed.
Edit: I only wanted to change 1 character but 6 is the minimum... Delete this...
You can try to sort both files then count the duplicate rows and select only those row where the count is 1
sort file1 file2 | uniq -c | awk '$1 == 1 {print $2}'

Find value from one csv in another one (like vlookup) in bash (Linux)

I have already tried all options that I found online to solve my issue but without good result.
Basically I have two csv files (pipe separated):
file1.csv:
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
file2.csv:
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
I need a linux bash script to find the value of pos.3 from file2 based on the content of pos7 in file1.
Example:
file1, line1, pos 7: MAYOBAN
find MAYOBAN in file2, return pos 3 (2400)
the output should be something like this:
**2400**
**2200**
**2200**
**etc...**
Please help
Jacek
A little approach, far away to be perfect:
DELIMITER="|"
for i in $(cut -f 7 -d "${DELIMITER}" file1.csv );
do
grep "${i}" file2.csv | cut -f 3 -d "${DELIMITER}";
done
This will work, but since the input files must be sorted, the output order will be affected:
join -t '|' -1 7 -2 1 -o 2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output would look like:
2200
2200
2400
which is useless. In order to have a useful output, include the key value:
join -t '|' -1 7 -2 1 -o 0,2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output then looks like this:
CORKCOR|2200
CORKKIN|2200
MAYOBAN|2400
Edit:
Here's an AWK version:
awk -F '|' 'FNR == NR {keys[$7]; next} {if ($1 in keys) print $3}' file1.csv file2.csv
This loops through file1.csv and creates array entries for each value of field 7. Simply referring to an array element creates it (with a null value). FNR is the record number in the current file and NR is the record number across all files. When they're equal, the first file is being processed. The next instruction reads the next record, creating a loop. When FNR == NR is no longer true, the subsequent file(s) are processed.
So file2.csv is now processed and if it has a field 1 that exists in the array, then its field 3 is printed.
You can use Miller (https://github.com/johnkerl/miller).
Starting from input01.txt
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
and input02.txt
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
and running
mlr --csv -N --ifs "|" join -j 7 -l 7 -r 1 -f input01.txt then cut -f 3 input02.txt
you will have
2400
2200
2200
Some notes:
-N to set input and output without header;
--ifs "|" to set the input fields separator;
-l 7 -r 1 to set the join fields of the input files;
cut -f 3 to extract the field named 3 from the join output
cut -d\| -f7 file1.csv|while read line
do
grep $line file1.csv|cut -d\| -f3
done

Resources