Cat headers and renaming a column header using awk? - linux

I've got an input file (input.txt) like this:
name value1 value2
A 3 1
B 7 4
C 2 9
E 5 2
And another file with a list of names (names.txt) like so:
B
C
Using grep -f, I can get all the lines with names "B" and "C"
grep -wFf names.txt input.txt
to get
B 7 4
C 2 9
However, I want to keep the header at the top of the output file, and also rename the column name "name" with "ID". And using grep, to keep the rows with names B and C, the output should be:
**ID** value1 value2
B 7 4
C 2 9
I'm thinking awk should be able to accomplish this, but being new to awk I'm not sure how to approach this. Help appreciated!

While it is certainly possible to do this in awk, the fastest way to solve your actual problem is to simply prepend the header you want in front of the grep output.
echo **ID** value1 value2 > Output.txt && grep -wFf names.txt input.txt >> Output.txt
Update Since the OP has multiple files, we can modify the above line to pull the first line out of the input file instead.
head -n 1 input.txt | sed 's/name/ID/' > Output.txt && grep -wFf names.txt input.txt >> Output.txt

Here is how to do it with awk
awk 'FNR==NR {a[$1];next} FNR==1 {$1="ID";print} {for (i in a) if ($1==i) print}' name input
ID value1 value2
B 7 4
C 2 9
Store the names in an array a
Then test filed #1 if it contains data in array a

Related

How to do divide a column based on the corresponding value in another file?

I have multiple files (66) and want to divid column 3 of each file to its corresponding value in the info.file and insert the new value in column 4 of each file.
My manual code is:
awk '{print $4=$3/NUmber from info.file}1' file
But this takes me hours to do for each individual file. So I want to automate it for all files. Thanks
file1:
chrm name value
4 a 8
3 b 4
file2:
chrm name value
3 g 6
5 s 12
info.file:
file_name average
file1 8
file2 6
file3 10
output:
file1:
chrm name value new_value
4 a 8 1
3 b 4 0.5
file2:
chrm name value new_value
3 g 6 1
5 s 12 2
without error handling
$ awk 'NR==FNR {a[$1]=$2; next}
FNR==1 {out=FILENAME".new"; print $0, "new_value" > out; next}
{v=$NF/a[FILENAME]; $++NF=v; print > out}' info file1 file2
will generate updated files
$ head file{1,2}.new | column -t
==> file1.new <==
chrm name value new_value
4 a 8 1
3 b 4 0.5
==> file2.new <==
chrm name value new_value
3 g 6 1
5 s 12 2
Explanation
NR==FNR {a[$1]=$2; next} scan the first file and save the file/value pairs in the associative array
FNR==1 in the header line of each data file
out=FILENAME".new" set a output filename
print $0, "new_value" > out print existing header appended with the new column name
v=$NF/a[FILENAME] for every data line, scale the last field and assign to v
$++NF=v increment number of fields and assign the new computed value to the last field
print > out print the new line to the same file set before
info file1 file2 the list of files should be preceded by the info file
I have prepared the following double nested awk command for you:
awk 'NR>1{system("awk -v div="$2" -f div_column3.awk "$1" | column -t > new_"$1);}' info.file
with div_column3.awk being a awk commands script file with the content:
$ cat div_column3.awk
NR==1{print $0" new_value"}NR>1{print $0" "$3/div}

If first two columns are equal, select top 3 based on descending order of 3rd column

I want to select top 3 results for every line that has the same first two column.
For example the data will look like,
cat data.txt
A A 10
A A 1
A A 2
A A 5
A A 8
A B 1
A B 2
A C 6
A C 5
A C 10
A C 1
B A 1
B A 1
B A 2
B A 8
And for the result I want
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 1
B A 1
B A 2
Note that some of the "groups" do not contain 3 rows.
I have tried
sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt
It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.
You can do:
$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A A 10
A A 8
A A 5
A B 2
A B 1
A C 10
A C 6
A C 5
B A 8
B A 2
B A 1
Explanation:
There are two key items to understand the awk program; associative arrays and fields.
If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.
You state If first two columns are equal...
The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.
You then state ...select top 3 based on descending order of 3rd column...
Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.
awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.
A wordier version of the program would be:
awk 'a[$1,$2]++<3 {print $0}'
But the default action if the condition is true is to print $0 so it is not needed.
If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.
Great place to start is the online book Effective AWK Programming by Arnold D. Robbins
#Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:
sort -k1,2 -k3,3nr file |
awk '
{key = $1 FS $2}
prev != key {prev = key; count = 1}
count <= 3 {print; count++}
'
You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.
sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
((c++))
else
c=0
fi
if (( c < 3 )) ; then
echo $c1 $c2 $n
l1=$c1
l2=$c2
fi
done

How to add number of identical line next to the line itself? [duplicate]

This question already has answers here:
Find duplicate lines in a file and count how many time each line was duplicated?
(7 answers)
Closed 7 years ago.
I have file file.txt which look like this
a
b
b
c
c
c
I want to know the command to which get file.txt as input and produces the output
a 1
b 2
c 3
I think uniq is the command you are looking for. The output of uniq -c is a little different from your format, but this can be fixed easily.
$ uniq -c file.txt
1 a
2 b
3 c
If you want to count the occurrences you can use uniq with -c.
If the file is not sorted you have to use sort first
$ sort file.txt | uniq -c
1 a
2 b
3 c
If you really need the line first followed by the count, swap the columns with awk
$ sort file.txt | uniq -c | awk '{ print $2 " " $1}'
a 1
b 2
c 3
You can use this awk:
awk '!seen[$0]++{ print $0, (++c) }' file
a 1
b 2
c 3
seen is an array that holds only uniq items by incrementing to 1 first time an index is populated. In the action we are printing the record and an incrementing counter.
Update: Based on comment below if intent is to get a repeat count in 2nd column then use this awk command:
awk 'seen[$0]++{} END{ for (i in seen) print i, seen[i] }' file
a 1
b 2
c 3

how to sort a file according to another file?

Is there a unix oneliner or some other quick way on linux to sort a file according to a permutation set by the sorting of another file?
i.e.:
file1: (separated by CRLFs, not spaces)
2
3
7
4
file2:
a
b
c
d
sorted file1:
2
3
4
7
so the result of this one liner should be
sorted file2:
a
b
d
c
paste file1 file2 | sort | cut -f2
Below is a perl one-liner that will print the contents of file2 based on the sorted input of file1.
perl -n -e 'BEGIN{our($x,$t,#a)=(0,1,)}if($t){$a[$.-1]=$_}else{$a[$.-1].=$_ unless($.>$x)};if(eof){$t=0;$x=$.;close ARGV};END{foreach(sort #a){($j,$l)=split(/\n/,$_,2);print qq($l)}}' file1 file2
Note: If the files are different lengths, the output will only print up to the shortest file length.
For example, if file-A has 5 lines and file-B has 8 lines then the output will only be 5 lines.

In a *nix environment, how would I group columns together?

I have the following text file:
A,B,C
A,B,C
A,B,C
Is there a way, using standard *nix tools (cut, grep, awk, sed, etc), to process such a text file and get the following output:
A
A
A
B
B
B
C
C
C
You can do:
tr , \\n
and that will generate
A
B
C
A
B
C
A
B
C
which you could sort.
Unless you want to pull the first column then second then third, in which case you want something like:
awk -F, '{for(i=1;i<=NF;++i) print i, $i}' | sort -sk1 | awk '{print $2}'
To explain this, the first part generates
1 A
2 B
3 C
1 A
2 B
3 C
1 A
2 B
3 C
the second part will stably sort (so the internal order is preserved)
1 A
1 A
1 A
2 B
2 B
2 B
3 C
3 C
3 C
and the third part will strip the numbers
You could use a shell for-loop combined with cut if you know in advanced the number of columns. Here is an example using bash syntax:
for i in {1..3}; do
cut -d, -f $i file.txt
done
Try:
awk 'BEGIN {FS=","} /([A-C],)+([A-C])?/ {for (i=1;i<=NF;i++) print $i}' YOURFILE | sort

Resources