Bash: join by numeric column - linux

If I want to use join on my Ubuntu, I need to first sort both files lexicographically (according to join --help), and only then join them:
tail -n +2 meta/201508_1 | sort -k 1b,1 > meta.txt
tail -n +2 keywords/copy | sort -k 1b,1 > keywords.txt
join meta.txt keywords.txt -1 1 -2 1 -t $'\t'
(I also remove the header from both of them using tail)
But instead of sorting files lexicographically, I would like to sort them numerically: the first column in both files is an ID.
tail -n +2 meta/201508_1 | sort -k1 -n > meta.txt
tail -n +2 keywords/copy.txt | sort -k1 -n > keywords.txt
And then join. But for join these files look unsorted:
join: meta.txt:10: is not sorted: 1023 301000 en
join: keywords.txt:2: is not sorted: 10 keyword1
If I add --nocheck-order to join, it doesn't join properly - it outputs just one line.
How do I join two files on their numerical ID in bash?
Sample (columns are tab-separated):
file 1
id volume lang
1 10 en
2 20 en
5 30 en
6 40 en
10 50 en
file 2
id keyword
4 kw1
2 kw2
10 kw3
1 kw4
3 kw5
desired output
1 kw4 10 en
2 kw2 20 en
10 kw3 50 en

Both of these work. The first one (sort -b is recommended on the Mac)
join <(sed 1d file1 | sort -b) <(sed 1d file2 | sort -b) | sort -n
the Linux man page recommends sort -k 1b,1
join <(sed 1d file1 | sort -k 1b,1) <(sed 1d file2 | sort -k 1b,1) | sort -n
In any case, you need to sort them lexicographically to join them. At the end you can still sort the result numerically.

You can ditch join and use awk instead:
awk -F'\t' 'FNR==1{next} NR==FNR{a[$1]=$2; next} $1 in a{print $1, a[$1], $2, $3}' file2 file1 | column -t
1 kw4 10 en
2 kw2 20 en
10 kw3 50 en
It is probably already in the order that you want (as per the ID column in file1). However if you need specific sorting you can do:
awk -F'\t' 'FNR==1{next} NR==FNR{a[$1]=$2; next} $1 in a{
print $1, a[$1], $2, $3}' file2 file1 | sort -nk1 | column -t
Note that column -t is there to produce tabular formatted output.

Related

Join two files linux

I am trying to join two files but they don't have the same number of lines. I need to join them by the second column.
File1:
11#San Noor#New York, US
22#Maria Shiry#Dubai, UA
55#John Smith#London, England
66#Viki Sam#Roman, Italy
81#Sara Moheeb#Montreal, Canada
File2:
C1#Steve White#11
C2#Hight Look#21
E1#The Heaven is more#52
I1#The Roma Seen#55
The output should be:
The output for paired lines should look like:
San Noor#Sereve White
The output for unpairable lines should look like:
Sara Moheeb#NA
(The file3 after joining should contain 5 lines and look as followed.)
San Noor#Steve White
Maria Shiry#Hight Look
John Smith#The Heaven is more
Viki Sam#The Roma Seen
Sara Moheeb#NA
I have tried to join these two files using this command:
join -t '#' -j2 -e "NA" <(sort -t '#' -k2 File1) <(sort -t '#' -k2 File2) > File3
It says that both files are not sorted. Also, I need a way to fill in missing values after join.
Extract relevant columns and paste them together.
paste -d '#' <(cut -d '#' -f2 file1) <(cut -d '#' -f2 file2)
Well, but this will fail for the NA case, when one file has less lines then the other. You could pipe it to something along awk -v OFS='#' -F'#' { for (i=1;i<NF;++i) if (length($i) == 0) $i="NA"; } to substitute empty fields for the string NA.
So I guess your method is a possible one, but you have nothing to "join" on the files. So join on an a imaginary column with line numbers:
join -t'#' -eNA -a1 -a2 -o1.2,2.2 <(cut -d'#' -f2 file1 | nl -w1 -s'#') <(cut -d'#' -f2 file2 | nl -w1 -s'#')

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

how to use awk/sed to deal with these two files to get a result that I want

I want to use awk/sed to deal with two files(a.txt and b.txt) below and get the result
cat a.txt
a UK
b Japan
c China
d Korea
e US
And cat b.txt results
c Russia
e Canada
The result that I want is as below:
a UK
b Japan
c Russia
d Korea
e Canada
With awk:
First fill aray/hash a with complete row ($0) and use first column ($1) from this row as index. Finally, print all elements of array/hash a with a loop.
awk '{a[$1]=$0} END{for(i in a) print a[i]}' file1 file2
Output:
a UK
b Japan
c Russia
d Korea
e Canada
try:
awk 'FNR==NR{A[$1]=$NF;next} {printf("%s %s\n",$1,$1 in A?A[$1]:$NF)}' b.txt a.txt
Checking here condition FNR==NR which will be TRUE only when first file(b.txt) is being read. Then creating an array named A whose index is $1 and have the value last column. Then using printf for printing 2 strings where first string is $1 and another is if $1 of a.txt is present in array A then print array A's value whose index is $1 else print last column of a.tzt itself.
EDIT: as OP had carriage characters into Input_files so please remove them by following too.
tr -d '\r' < b.txt > temp_b.txt && mv temp_b.txt b.txt
You can use the below one-liner:
join -a 1 -a 2 a.txt <( awk '{print $1, "--", $0, "--"}' < b.txt ) | sed 's/ --$//' | awk -F ' -- ' '{print $NF}'
We use awk to prefix each line in b.txt with a key and -- to give us a split point later:
<( awk '{print $1, "--", $0, "--"}' < b.txt )
Use the join command to join the files on common keys. The -a 1 option tells the command to
join -a 1 -a 2 a.txt <( awk '{print $1, "--", $0, "--"}' < b.txt )
Use sed to remove the -- parts that are on some end of lines:
sed 's/ --$//'
Use awk to print the last item on each line:
awk -F ' -- ' '{print $NF}'
$ awk 'NR==FNR{b[$1]=$2;next} {print $1, ($1 in b ? b[$1] : $2)}' b.txt a.txt
a UK
b Japan
c Russia
d Korea
e Canada

How can I sort a 10GB file?

I'm trying to sort a big table stored in a file. The format of the file is
(ID, intValue)
The data is sorted by ID, but what I need is to sort the data using the intValue, in descending order.
For example
ID | IntValue
1 | 3
2 | 24
3 | 44
4 | 2
to this table
ID | IntValue
3 | 44
2 | 24
1 | 3
4 | 2
How can I use the Linux sort command to do the operation? Or do you recommend another way?
How can I use the Linux sort command to do the operation? Or do you recommend another way?
As others have already pointed out, see man sort for -k & -t command line options on how to sort by some specific element in the string.
Now, the sort also has facility to help sort huge files which potentially don't fit into the RAM. Namely the -m command line option, which allows to merge already sorted files into one. (See merge sort for the concept.) The overall process is fairly straight forward:
Split the big file into small chunks. Use for example the split tool with the -l option. E.g.:
split -l 1000000 huge-file small-chunk
Sort the smaller files. E.g.
for X in small-chunk*; do sort -t'|' -k2 -nr < $X > sorted-$X; done
Merge the sorted smaller files. E.g.
sort -t'|' -k2 -nr -m sorted-small-chunk* > sorted-huge-file
Clean-up: rm small-chunk* sorted-small-chunk*
The only thing you have to take special care about is the column header.
How about:
sort -t' ' -k2 -nr < test.txt
where test.txt
$ cat test.txt
1 3
2 24
3 44
4 2
gives sorting in descending order (option -r)
$ sort -t' ' -k2 -nr < test.txt
3 44
2 24
1 3
4 2
while this sorts in ascending order (without option -r)
$ sort -t' ' -k2 -n < test.txt
4 2
1 3
2 24
3 44
in case you have duplicates
$ cat test.txt
1 3
2 24
3 44
4 2
4 2
use the uniq command like this
$ sort -t' ' -k2 -n < test.txt | uniq
4 2
1 3
2 24
3 44

Find value from one csv in another one (like vlookup) in bash (Linux)

I have already tried all options that I found online to solve my issue but without good result.
Basically I have two csv files (pipe separated):
file1.csv:
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
file2.csv:
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
I need a linux bash script to find the value of pos.3 from file2 based on the content of pos7 in file1.
Example:
file1, line1, pos 7: MAYOBAN
find MAYOBAN in file2, return pos 3 (2400)
the output should be something like this:
**2400**
**2200**
**2200**
**etc...**
Please help
Jacek
A little approach, far away to be perfect:
DELIMITER="|"
for i in $(cut -f 7 -d "${DELIMITER}" file1.csv );
do
grep "${i}" file2.csv | cut -f 3 -d "${DELIMITER}";
done
This will work, but since the input files must be sorted, the output order will be affected:
join -t '|' -1 7 -2 1 -o 2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output would look like:
2200
2200
2400
which is useless. In order to have a useful output, include the key value:
join -t '|' -1 7 -2 1 -o 0,2.3 <(sort -t '|' -k7,7 file1.csv) <(sort -t '|' -k1,1 file2.csv)
The output then looks like this:
CORKCOR|2200
CORKKIN|2200
MAYOBAN|2400
Edit:
Here's an AWK version:
awk -F '|' 'FNR == NR {keys[$7]; next} {if ($1 in keys) print $3}' file1.csv file2.csv
This loops through file1.csv and creates array entries for each value of field 7. Simply referring to an array element creates it (with a null value). FNR is the record number in the current file and NR is the record number across all files. When they're equal, the first file is being processed. The next instruction reads the next record, creating a loop. When FNR == NR is no longer true, the subsequent file(s) are processed.
So file2.csv is now processed and if it has a field 1 that exists in the array, then its field 3 is printed.
You can use Miller (https://github.com/johnkerl/miller).
Starting from input01.txt
123|21|0452|IE|IE|1|MAYOBAN|BRIN|OFFICE|STREET|MAIN STREET|MAYOBAN|
123|21|0453|IE|IE|1|CORKKIN|ROBERT|SURNAME|CORK|APTS|CORKKIN|
123|21|0452|IE|IE|1|CORKCOR|NAME|HARRINGTON|DUBLIN|STREET|CORKCOR|
and input02.txt
MAYOBAN|BANGOR|2400
MAYOBEL|BELLAVARY|2400
CORKKIN|KINSALE|2200
CORKCOR|CORK|2200
DUBLD11|DUBLIN 11|2100
and running
mlr --csv -N --ifs "|" join -j 7 -l 7 -r 1 -f input01.txt then cut -f 3 input02.txt
you will have
2400
2200
2200
Some notes:
-N to set input and output without header;
--ifs "|" to set the input fields separator;
-l 7 -r 1 to set the join fields of the input files;
cut -f 3 to extract the field named 3 from the join output
cut -d\| -f7 file1.csv|while read line
do
grep $line file1.csv|cut -d\| -f3
done

Resources