Modifying text using awk

Modifying text using awk - linux

I am trying to modify text files using awk. There are three columns and I want to delete part of the text in the first column:
range=chr1 20802865 20802871
range=chr1 23866528 23866534
to
chr1 20802865 20802871
chr1 23866528 23866534
How can I do this?
I've tried awk '{ substr("range=chr*", 7) }' and awk '{sub(/[^[:space:]]*\\/, "")}1' but it deletes all the contents of the file.

Set the field separator as = and print the second field:
# With awk
$ awk -F= '{print $2}' file
chr1 20802865 20802871
chr1 23866528 23866534
# Or with cut
$ cut -d= -f2 file
chr1 20802865 20802871
chr1 23866528 23866534
# How about grep
$ grep -Po '(?<==).*' file
chr1 20802865 20802871
chr1 23866528 23866534
# Temp file needed
$ cut -d= -f2 file > tmp; mv tmp file
Both awk, cut and grep require temporary files if you want to store the changes back into file, a better solution would be to use sed:
sed -i 's/range=//' file
This substitutes range= with nothing and the -i means the changes are done in-place so no need to handle the temporary files stuff as sed does it for you.

If you don't need to use awk, you can use sed, which I find a bit simpler. Hopefully you are familiar with regex operators, like ^ and ..
$ cat awkens
range=chr1 20802865 20802871
range=chr1 23866528 23866534
$ sed 's/^range=//' awkens
chr1 20802865 20802871
chr1 23866528 23866534

It looks like you are using tabs instead of spaces as delimiters in your file, so:
awk 'BEGIN{FS="[=\t]"; OFS="\t"} {print $2, $3, $4}' input_file
or
awk 'BEGIN{FS="[=\t]"; OFS="\t"} {$1=""; gsub("\t\t", "\t"); print}' input_file

Related

How to remove n characters from a specific column using sed/awk/perl

I have the following tab delimited data:
chr1 3119713 3119728 MA05911Bach1Mafk 839 +
chr1 3119716 3119731 MA05011MAFNFE2 860 +
chr1 3120036 3120051 MA01502Nfe2l2 866 +
What I want to do is to remove 7 characters from 4th column.
Resulting in
chr1 3119713 3119728 Bach1Mafk 839 +
chr1 3119716 3119731 MAFNFE2 860 +
chr1 3120036 3120051 Nfe2l2 866 +
How can I do that?
Note the output needs to be also TAB separated.
I'm stuck with the following code, which replaces from the first
column onward, which I don't want
sed 's/^.\{7\}//' myfile.txt

awk '{ $4 = substr($4, 8); print }'

perl -anE'$F[3] =~ s/.{7}//; say join "\t", #F' data.txt
or
perl -anE'substr $F[3],0,7,""; say join "\t", #F' data.txt

With sed
$ sed -E 's/^(([^\t]+\t){3}).{7}/\1/' myfile.txt
chr1 3119713 3119728 Bach1Mafk 839 +
chr1 3119716 3119731 MAFNFE2 860 +
chr1 3120036 3120051 Nfe2l2 866 +
-E use extended regular expressions, to avoid having to use \ for (){}. Some sed versions might need -r instead of -E
^(([^\t]+\t){3}) capture the first three columns, easy to change number of columns if needed
.{7} characters to delete from 4th column
\1 the captured columns
Use -i option for in-place editing
With perl you can use \K for variable length positive lookbehind
perl -pe 's/^([^\t]+\t){3}\K.{7}//' myfile.txt

Faster solution to compare files in bash

file1:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
chr1 16857 17055 NR_024540_4_r_WASH7P_198
and file2:
NR_024540 11
I need find match file2 in file1 and print whole file1 + second column of file2
So ouptut is:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
My solution is very slow in bash:
#!/bin/bash
while read line; do
c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output
done < file2
I am prefer FASTER any bash or awk solution. Output can be modified, but need keep all the informations (order of column can be different).
EDIT:
Right now it looks like fastest solution according #chepner:
#!/bin/bash
while read -r c d; do
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}'
done < file2 > output

In a single Awk command,
awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
A more readable version in a multi-liner
FNR==NR {
# map the values from 'file2' into the hash-map 'map'
map[$1]=$2
next
}
# On 'file1' do
{
# Iterate through the array map
for (i in map){
# If there is a direct regex match on the line with the
# element from the hash-map, print it and append the
# hash-mapped value at last
if($0 ~ i){
$(NF+1)=map[i]
print
next
}
}
}

Another solution using join and sed, Under the assumption that file1 and file2 are sorted
join <(sed -r 's/[^ _]+_[^_]+/& &/' file1) file2 -1 4 -2 1 -o "1.1 1.2 1.3 1.5 2.2" > output
If the output order doesn't matter, to use awk
awk 'FNR==NR{d[$1]=$2; next}
{split($4,v,"_"); key=v[1]"_"v[2]; if(key in d) print $0, d[key]}
' file2 file1
you get,
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

try this -
cat file2
NR_024540 11
NR_024541 12
cat file11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14361 14829 NR_024542_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
chr1 16857 17055 NR_024540_4_r_WASH7P_198
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
Performance - (Tested for 55000 records)
time awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file1 > output1
real 0m0.16s
user 0m0.14s
sys 0m0.01s

You are starting a lot of external programs unnecessarily. Let read split the incoming line from file2 for you instead of calling awk twice. There is also no need to run grep; awk can do the filtering itself.
while read -r c d; do
awk -v field="$c" -v line="$d" -v OFS='\t' '$0 ~ field {print $1,$2,$3,$4"_"line}' file1
done < file2 > output

If the searched string is always the same length (length("NR_024540")==9):
awk 'NR==FNR{a[$1]=$2;next} (i=substr($4,1,9)) && (i in a){print $0, a[i]}' file2 file1
Explained:
NR==FNR { # process file2
a[$1]=$2 # hash record using $1 as the key
next # skip to next record
}
(i=substr($4,1,9)) && (i in a) { # read the first 9 bytes of $4 to i and search in a
print $0, a[i] # output if found
}

awk -F '[[:blank:]_]+' '
FNR==NR { a[$2]=$3 ;next }
{ if ( $5 in a ) $0 = $0 " " a[$5] }
7
' file2 file1
Comment:
use _ as extra field separator so file names are easier to compare in both file (using only the number part).
7 is for fun, it's just a non 0 value -> print the line
i don't change the field (NF+1, ...) so we keep the original format adding just the referenced number
a smaller oneliner code (optimized for code size) (assuming non empty lines in file1 that are mandatory). if separator are only space, you can remplace [:blank:] by a space char
awk -F '[[:blank:]_]+' 'NF==3{a[$2]=$3;next}$0=$0" "a[$5]' file2 file1

No awk or sed needed. This assumes file2 is only one line:
n="`cut -f 2 file2`" ; while read x ; do echo "$x $n" ; done < file1

How to create a CSV file based on row in shell script?

I have a text file /tmp/some.txt with below values
JOHN YES 6 6 2345762
SHAUN NO 6 6 2345748
I want to create a csv file with below format (i.e based on rows. NOT based on columns).
JOHN,YES,6,6,2345762
SHAUN,NO,6,6,2345748
i tried below code
for i in `wc -l /tmp/some.txt | awk '{print $1}'`
do
awk 'NR==$i' /tmp/some.txt | awk '{print $1","$2","$3","$4","$5}' >> /tmp/some.csv
done
here wc -l /tmp/some.txt | awk '{print $1}' will get the value as 2 (i.e 2 rows in text file).
and for each row awk 'NR==$i' /tmp/some.txt | awk '{print $1","$2","$3","$4","$5}' will print the 5 fields into some.csvfile which is separated by comma.
when i execute each command separately it will work. but when i make it as a shell script i'm getting empty some.csv file.

#Kart: Could you please try following.
awk '{$1=$1;} 1' OFS=, Input_file > output.csv
I hope this helps you.

I suggest:
sed 's/[[:space:]]\+/,/g' /tmp/some.txt

You almost got it. awk already process the file row by row, so you don't need to iterate with the for loop.
So you just need to run:
awk '{print $1","$2","$3","$4","$5}' /tmp/some.txt >> /tmp/some.csv

With tr, squeezing (-s), and then transliterating space/tab ([:blank:]):
tr -s '[:blank:]' ',' <file.txt
With sed, substituting one or more space/tab with ,:
sed 's/[[:blank:]]\+/,/g' file.txt
With awk, replacing one ore more space/tab with , using gsub() function:
awk 'gsub("[[:blank:]]+", ",", $0)' file.txt
Example
% cat foo.txt
JOHN YES 6 6 2345762
SHAUN NO 6 6 2345748
% tr -s '[:blank:]' ',' <foo.txt
JOHN,YES,6,6,2345762
SHAUN,NO,6,6,2345748
% sed 's/[[:blank:]]\+/,/g' foo.txt
JOHN,YES,6,6,2345762
SHAUN,NO,6,6,2345748
% awk 'gsub("[[:blank:]]+", ",", $0)' foo.txt
JOHN,YES,6,6,2345762
SHAUN,NO,6,6,2345748

Process multiple files and append them in linux/unix

I have over 100 files with at least 5-8 columns (tab-separated) in each file. I need to extract first three columns from each file and add fourth column with some predefined text and append them.
Let's say I have 3 files: file001.txt, file002.txt, file003.txt.
file001.txt:
chr1 1 2 15
chr2 3 4 17
file002.txt:
chr1 1 2 15
chr2 3 4 17
file003.txt:
chr1 1 2 15
chr2 3 4 17
combined_file.txt:
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
For simplicity I kept file contents same.
My script is as follows:
#!/bin/bash
for i in {1..3}; do
j=$(printf '%03d' $i)
awk 'BEGIN { OFS="\t"}; {print $1,$2,$3}' file${j}.txt | awk -v k="$j" 'BEGIN {print $0"\t$k”}' | cat >> combined_file.txt
done
But the script is giving the following errors:
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
Can some one help me to figure it out?

You don't need two different awk scripts. And you don't use $ to refer to variables in awk, that's used to refer to input fields (i.e. $k means access the field whose number is in the variable k).
for i in {1..3}; do
j=$(printf '%03d' $i)
awk -v k="$j" -v OFS='\t' '{print $1, $2, $3, k}' file$j.txt
done > combined_file.txt

As pointed out in the comments your problem is youre trying to use odd characters as if they were double quotes. Once you fix that though, you don't need a loop or any of that other complexity all you need is:
$ awk 'BEGIN{FS=OFS="\t"} {$NF="f"ARGIND} 1' file*
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
The above used GNU awk for ARGIND.

linux get specific fields from several files

I have several files starting with the string "file" then a number (file1, file2, etc).
The content of these files is similar and looks like this
file1:
$xx_ at 10.0 "$elt_(0) coordinates 636.46 1800.37 9.90"
$xx_ at 10.0 "$elt_(1) coordinates 367.78 1263.63 7.90"
For each file, I want only keep the index of the element and the 2 numeric fields just after coordinates (in the same file or in another file):
file1:
0 636.46 1800.37
1 367.78 1263.63
What I tried to do is like this (but It is not correct)
find . -name "file*"|while read fname; do
echo "$fname"
for line in $(cat "$fname") do
FS="[_() ]"
print $7 "\t" $10 "\t" $11 > $fname
done
done

This is the perfect use of awk.
With awk you can simply print specific words.
If the index is the line number you can use this:
cat -n ./file1 | awk '{print $1 -1 " " $7 " " $8}' this simply prints the files with line numbers and prints the first, seventh and eighth word.
If the index is the $elt_(0) number in parenthese you can use sed like this:
cat ./file1 | awk '{print $4 " " $7 " " $8}' | sed 's/"$elt_(//g' | sed 's/)//g' | sed 's/"//g'
output:
1 636.46 1800.37
2 367.78 1263.63
Link to awk
Link to sed

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Modifying text using awk - linux

If you don't need to use awk, you can use sed, which I find a bit simpler. Hopefully you are familiar with regex operators, like ^ and .. $ cat awkens range=chr1 20802865 20802871 range=chr1 23866528 23866534 $ sed 's/^range=//' awkens chr1 20802865 20802871 chr1 23866528 23866534

It looks like you are using tabs instead of spaces as delimiters in your file, so: awk 'BEGIN{FS="[=\t]"; OFS="\t"} {print $2, $3, $4}' input_file or awk 'BEGIN{FS="[=\t]"; OFS="\t"} {$1=""; gsub("\t\t", "\t"); print}' input_file

Related

How to remove n characters from a specific column using sed/awk/perl

Faster solution to compare files in bash

How to create a CSV file based on row in shell script?

Process multiple files and append them in linux/unix

linux get specific fields from several files

Categories

Resources