substitute consecutive tabs for "\tNA\t" - linux

Have a badly formatted tsv file with empty fields all over the place. I wish to fill these empty spaces with "NA" on linux.
I tried awk '{gsub("\t\t","\tNA\t"); print$0)' but that only substitutes one empty space to NA instance. Chaining the command awk '{gsub("\t\t","\tNA\t"); print$0)|awk '{gsub("\t\t","\tNA\t"); print$0) does two substitutions per line - but not particularly helpful if I have many columns to deal with.
Is there a faster (neater) way to do this?

Its a bit complex since you have to handle newlines empty fields, end of line empty fields and potentially successive empty fields. I could not achieve something with sed, it's probably insane. But with awk this seems to work:
$ cat test.txt
a c d e
g h i j
k l m n
p s t
w x
$ awk -F$'\t' '{for(i=1;i<=NF;++i){if($i==""){printf "NA"}else{printf $i} if(i<NF)printf "\t"} printf "\n"}' test.txt
a NA c d e
NA g h i j
k l m n NA
p NA NA s t
NA NA w x NA
Beware copy paste, the tabs will probably be transformed to spaces... By the way I searched a solution for the CSV files, and adapted it from this thread ;) where you can see that the most readable option is the awk one.

Did you try with sed? For example:
cat test.txt
test test test
test test test
sed 's:\t\t*:\tNA\t:g' test.txt
test NA test NA test
test NA test NA test

Ok this works:
awk '{ gsub(/\t\t\t/,"\tNA\tNA\t"); print $0}' test.txt | awk '{ gsub(/\t\t/,"\tNA\t"); print $0}' | awk '{ gsub(/\t\t/,"\tNA\t"); print
$0}' | awk '{gsub(/^[\t]+/,"NA\t"); print $0}'
interestingly this doesn't:
awk '{ gsub(/\t\t\t/,"\tNA\tNA\t"); print $0}' test.txt | awk '{ gsub(/\t\t/,"\tNA\t"); print $0}' | awk '{gsub(/^[\t]+/,"NA\t"); print
$0}'
I'm sure there is a more elegant solution though..

Related

Print lines not containg a period linux

I have a file with thousands of rows. I want to print the rows which do not contain a period.
awk '{print$2}' file.txt | head
I have used this to print the column I am interested in, column 2 (The file only has two columns).
I have removed the head and then did
awk '{print$2}' file.txt | grep -v "." | head
But I only get blank lines not any actual values which is expected, I think it has included the spaces between the rows but I am not sure.
Is there an alternative command?
As suggested by Jim, I did-
awk '{print$2}' file.txt | grep -v "\." | head
However the number of lines is greater than before, is this expected? Also, my output is a list of numbers but with spaces in between them (Vertical), is this normal?
file.txt example below-
120.4 3
270.3 7.9
400.8 3.9
200.2 4
100.2 8.7
300.2 3.4
102.3 6
49.0 2.3
38.0 1.2
So the expected (and correct) output would be 3 lines, as there is 3 values in column 2 without the period:
$ awk '{print$2}' file.txt | grep -v "\." | head
3
4
6
However, when running the code as above, I instead get 5, which is also counting the spaces between the rows I think:
$ awk '{print$2}' file.txt | grep -v "\." | head
3
4
6
You seldom need to use grep if you're already using awk
This would print the second column on each line where that second column doesn't contain a dot:
awk '$2 !~ /\./ {print $2}'
But you also wanted to skip empty lines, or perhaps ones where the second column is not empty. So just test for that, too:
awk '$2 != "" && $2 !~ /\./ {print $2}'
(A more amusing version would be awk '$2 ~ /./ && $2 !~ /\./ {print $2}' )
As you said, grep -v "." gives you only blank lines. That's because the dot means "any character", and with -v, the only lines printed are those that don't contain, well, any characters.
grep is interpreting the dot as a regex metacharacter (the dot will match any single character). Try escaping it with a backslash:
awk '{print$2}' file.txt | grep -v "\." | head
If I understand well, you can try this sed
sed ':A;N;${s/.*/&\n/};/\n$/!bA;s/\n/ /g;s/\([^ ]*\.[^ ]* \)//g' file.txt
output
3
4
6

how to use awk/sed to deal with these two files to get a result that I want

I want to use awk/sed to deal with two files(a.txt and b.txt) below and get the result
cat a.txt
a UK
b Japan
c China
d Korea
e US
And cat b.txt results
c Russia
e Canada
The result that I want is as below:
a UK
b Japan
c Russia
d Korea
e Canada
With awk:
First fill aray/hash a with complete row ($0) and use first column ($1) from this row as index. Finally, print all elements of array/hash a with a loop.
awk '{a[$1]=$0} END{for(i in a) print a[i]}' file1 file2
Output:
a UK
b Japan
c Russia
d Korea
e Canada
try:
awk 'FNR==NR{A[$1]=$NF;next} {printf("%s %s\n",$1,$1 in A?A[$1]:$NF)}' b.txt a.txt
Checking here condition FNR==NR which will be TRUE only when first file(b.txt) is being read. Then creating an array named A whose index is $1 and have the value last column. Then using printf for printing 2 strings where first string is $1 and another is if $1 of a.txt is present in array A then print array A's value whose index is $1 else print last column of a.tzt itself.
EDIT: as OP had carriage characters into Input_files so please remove them by following too.
tr -d '\r' < b.txt > temp_b.txt && mv temp_b.txt b.txt
You can use the below one-liner:
join -a 1 -a 2 a.txt <( awk '{print $1, "--", $0, "--"}' < b.txt ) | sed 's/ --$//' | awk -F ' -- ' '{print $NF}'
We use awk to prefix each line in b.txt with a key and -- to give us a split point later:
<( awk '{print $1, "--", $0, "--"}' < b.txt )
Use the join command to join the files on common keys. The -a 1 option tells the command to
join -a 1 -a 2 a.txt <( awk '{print $1, "--", $0, "--"}' < b.txt )
Use sed to remove the -- parts that are on some end of lines:
sed 's/ --$//'
Use awk to print the last item on each line:
awk -F ' -- ' '{print $NF}'
$ awk 'NR==FNR{b[$1]=$2;next} {print $1, ($1 in b ? b[$1] : $2)}' b.txt a.txt
a UK
b Japan
c Russia
d Korea
e Canada

Making horizontal String vertical shell or awk

I have a string
ABCDEFGHIJ
I would like it to print.
A
B
C
D
E
F
G
H
I
J
ie horizontal, no editing between characters to vertical. Bonus points for how to put a number next to each one with a single line. It'd be nice if this were an awk or shell script, but I am open to learning new things. :) Thanks!
If you just want to convert a string to one-char-per-line, you just need to tell awk that each input character is a separate field and that each output field should be separated by a newline and then recompile each record by assigning a field to itself:
awk -v FS= -v OFS='\n' '{$1=$1}1'
e.g.:
$ echo "ABCDEFGHIJ" | awk -v FS= -v OFS='\n' '{$1=$1}1'
A
B
C
D
E
F
G
H
I
J
and if you want field numbers next to each character, see #Kent's solution or pipe to cat -n.
The sed solution you posted is non-portable and will fail with some seds on some OSs, and it will add an undesirable blank line to the end of your sed output which will then become a trailing line number after your pipe to cat -n so it's not a good alternative. You should accept #Kent's answer.
awk one-liner:
awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)print i,$i}'
test :
kent$ echo "ABCDEF"|awk 'BEGIN{FS=""}{for(i=1;i<=NF;i++)print i,$i}'
1 A
2 B
3 C
4 D
5 E
6 F
So I figured this one out on my own with sed.
sed 's/./&\n/g' horiz.txt > vert.txt
One more awk
echo "ABCDEFGHIJ" | awk '{gsub(/./,"&\n")}1'
A
B
C
D
E
F
G
H
I
J
This might work for you (GNU sed):
sed 's/\B/\n/g' <<<ABCDEFGHIJ
for line numbers:
sed 's/\B/\n/g' <<<ABCDEFGHIJ | sed = | sed 'N;y/\n/ /'
or:
sed 's/\B/\n/g' <<<ABCDEFGHIJ | cat -n

linux command to get the last appearance of a string in a text file

I want to find the last appearance of a string in a text file with linux commands. For example
1 a 1
2 a 2
3 a 3
1 b 1
2 b 2
3 b 3
1 c 1
2 c 2
3 c 3
In such a text file, i want to find the line number of the last appearance of b which is 6.
I can find the first appearance with
awk '/ b / {print NR;exit}' textFile.txt
but I have no idea how to do it for the last occurrence.
cat -n textfile.txt | grep " b " | tail -1 | cut -f 1
cat -n prints the file to STDOUT prepending line numbers.
grep greps out all lines containing "b" (you can use egrep for more advanced patterns or fgrep for faster grep of fixed strings)
tail -1 prints last line of those lines containing "b"
cut -f 1 prints first column, which is line # from cat -n
Or you can use Perl if you wish (It's very similar to what you'd do in awk, but frankly, I personally don't ever use awk if I have Perl handy - Perl supports 100% of what awk can do, by design, as 1-liners - YMMV):
perl -ne '{$n=$. if / b /} END {print "$n\n"}' textfile.txt
This can work:
$ awk '{if ($2~"b") a=NR} END{print a}' your_file
We check every second file being "b" and we record the number of line. It is appended, so by the time we finish reading the file, it will be the last one.
Test:
$ awk '{if ($2~"b") a=NR} END{print a}' your_file
6
Update based on sudo_O advise:
$ awk '{if ($2=="b") a=NR} END{print a}' your_file
to avoid having some abc in 2nd field.
It is also valid this one (shorter, I keep the one above because it is the one I thought :D):
$ awk '$2=="b" {a=NR} END{print a}' your_file
Another approach if $2 is always grouped (may be more efficient then waiting until the end):
awk 'NR==1||$2=="b",$2=="b"{next} {print NR-1; exit}' file
or
awk '$2=="b"{f=1} f==1 && $2!="b" {print NR-1; exit}' file

In a *nix environment, how would I group columns together?

I have the following text file:
A,B,C
A,B,C
A,B,C
Is there a way, using standard *nix tools (cut, grep, awk, sed, etc), to process such a text file and get the following output:
A
A
A
B
B
B
C
C
C
You can do:
tr , \\n
and that will generate
A
B
C
A
B
C
A
B
C
which you could sort.
Unless you want to pull the first column then second then third, in which case you want something like:
awk -F, '{for(i=1;i<=NF;++i) print i, $i}' | sort -sk1 | awk '{print $2}'
To explain this, the first part generates
1 A
2 B
3 C
1 A
2 B
3 C
1 A
2 B
3 C
the second part will stably sort (so the internal order is preserved)
1 A
1 A
1 A
2 B
2 B
2 B
3 C
3 C
3 C
and the third part will strip the numbers
You could use a shell for-loop combined with cut if you know in advanced the number of columns. Here is an example using bash syntax:
for i in {1..3}; do
cut -d, -f $i file.txt
done
Try:
awk 'BEGIN {FS=","} /([A-C],)+([A-C])?/ {for (i=1;i<=NF;i++) print $i}' YOURFILE | sort

Resources