unix join command to return all columns in one file - linux

I have two files that I am joining on one column. After the join, I just want the output to be all of the columns, in the original order, from only one of the files. For example:
cat file1.tsv
1 a ant
2 b bat
3 c cat
8 d dog
9 e eel
cat file2.tsv
1 I
2 II
3 III
4 IV
5 V
join -1 1 -2 1 file1.tsv file2.tsv -t $'\t' -o 1.1,1.2,1.3
1 a ant
2 b bat
3 c cat
I know I an use -o 1.1,1.2.. notation but my file has over two dozen columns. Is there some wildcard that I can use to say -o 1.* or something?

I'm not aware of wildcards in the format string.
From your desired output I think that what you want may be achievable like so without having to specify all the enumerations:
grep -f <(awk '{print $1}' file2.tsv ) file1.tsv
1 a ant
2 b bat
3 c cat
Or as an awk-only solution:
awk '{if(NR==FNR){a[$1]++}else{if($1 in a){print}}}' file2.tsv file1.tsv
1 a ant
2 b bat
3 c cat

Related

How to merge two CSV files with Linux column wise? [duplicate]

This question already has answers here:
combining columns of 2 files using shell script
(1 answer)
Are shell scripts sensitive to encoding and line endings?
(14 answers)
Closed 1 year ago.
I am looking for a simple line of code (if possible) to simply merge two files column wise and save the final result into a new file.
Edited to response to the first answer #heitor:
By using paste file1.csv file2.csv, What happened is:
For instance file 1:
A B
1 2
file2:
C D
3 4
By doing paste -d , file1.csv file2.csv >output.csv I got
A B
C D
1 2
3 4
not
A B C D
1 2 3 4
By doing cat file1.csv file2.csv I got
A B
1 2
C D
3 4
Neither of them is what I want. Any idea?
Any idea?
Use paste -d , to merge the two files and > to redirect the command output to another file:
$ paste -d , file1.csv file2.csv > output.csv
E.g.:
$ cat file1.csv
A,B
$ cat file2.csv
C,D
$ paste -d , file1.csv file2.csv > output.csv
$ cat output.csv
A,B,C,D
-d , tells paste to use , as the delimiter to join the columns.
> tells the shell to write the output of the paste command to the file output.csv
Indeed using paste is pretty simple,
$ cat file1.csv
A B
1 2
$ cat file2.csv
C D
3 4
$ paste -d " " file1.csv file2.csv
A B C D
1 2 3 4
With the -d option I replaced the default tab character with a space.
Edit:
In case you want to redirect that to another file then,
paste -d " " file1.csv file2.csv > file3.csv
$ cat file3.csv
A B C D
1 2 3 4

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

Cat headers and renaming a column header using awk?

I've got an input file (input.txt) like this:
name value1 value2
A 3 1
B 7 4
C 2 9
E 5 2
And another file with a list of names (names.txt) like so:
B
C
Using grep -f, I can get all the lines with names "B" and "C"
grep -wFf names.txt input.txt
to get
B 7 4
C 2 9
However, I want to keep the header at the top of the output file, and also rename the column name "name" with "ID". And using grep, to keep the rows with names B and C, the output should be:
**ID** value1 value2
B 7 4
C 2 9
I'm thinking awk should be able to accomplish this, but being new to awk I'm not sure how to approach this. Help appreciated!
While it is certainly possible to do this in awk, the fastest way to solve your actual problem is to simply prepend the header you want in front of the grep output.
echo **ID** value1 value2 > Output.txt && grep -wFf names.txt input.txt >> Output.txt
Update Since the OP has multiple files, we can modify the above line to pull the first line out of the input file instead.
head -n 1 input.txt | sed 's/name/ID/' > Output.txt && grep -wFf names.txt input.txt >> Output.txt
Here is how to do it with awk
awk 'FNR==NR {a[$1];next} FNR==1 {$1="ID";print} {for (i in a) if ($1==i) print}' name input
ID value1 value2
B 7 4
C 2 9
Store the names in an array a
Then test filed #1 if it contains data in array a

Uniq and counts

Have a file with 2 columns,
need to use uniq on column 1 only and print
both the columns in the results as well as the count of the occurrences
(with -c).
Example:
1 a
1 a
2 a
3 c
4 d
2 1 a
1 2 a
1 3 c
1 4 d
echo '1 a
1 a
2 a
3 c
4 d
' | uniq -c
outputs exactly your 2nd block.
It's not clear to me what you mean by "use uniq on column 1 only." What do you want to happen if column 1 appears multiple times with different column 2 values? If this can happen, your question probably needs a little clarification. If this can't happen in your scenario, then the easiest solution is probably
uniq -c filename
if this in a file then
cat filename.txt|awk '{print $1}'|uniq -c

In a *nix environment, how would I group columns together?

I have the following text file:
A,B,C
A,B,C
A,B,C
Is there a way, using standard *nix tools (cut, grep, awk, sed, etc), to process such a text file and get the following output:
A
A
A
B
B
B
C
C
C
You can do:
tr , \\n
and that will generate
A
B
C
A
B
C
A
B
C
which you could sort.
Unless you want to pull the first column then second then third, in which case you want something like:
awk -F, '{for(i=1;i<=NF;++i) print i, $i}' | sort -sk1 | awk '{print $2}'
To explain this, the first part generates
1 A
2 B
3 C
1 A
2 B
3 C
1 A
2 B
3 C
the second part will stably sort (so the internal order is preserved)
1 A
1 A
1 A
2 B
2 B
2 B
3 C
3 C
3 C
and the third part will strip the numbers
You could use a shell for-loop combined with cut if you know in advanced the number of columns. Here is an example using bash syntax:
for i in {1..3}; do
cut -d, -f $i file.txt
done
Try:
awk 'BEGIN {FS=","} /([A-C],)+([A-C])?/ {for (i=1;i<=NF;i++) print $i}' YOURFILE | sort

Resources