Making pairs of words based on one column - linux

I want to make pairs of words based on the third column (identifier). My file is similar to this example:
A ID.1
B ID.2
C ID.1
D ID.1
E ID.2
F ID.3
The result I want is:
A C ID.1
A D ID.1
B E ID.2
C D ID.1
Note that I don't want to obtain the same word pair in the opposite order. In my real file some words appear more than one time with different identifiers.
I tried this code which works well but requires a lot of time (and I don't know if there are redundancies):
counter=2
cat filtered_go_annotation.txt | while read f1 f2; do
tail -n +$counter go_annotation.txt | grep $f2 | awk '{print "'$f1' " $1}';
((counter++))
done > go_network2.txt
The 'tail' is used to delete a line when it's read.

Awk solution:
awk '{ a[$2] = ($2 in a? a[$2] FS : "") $1 }
END {
for (k in a) {
len = split(a[k], items);
for (i = 1; i <= len; i++)
for (j = i+1; j <= len; j++)
print items[i], items[j], k
}
}' filtered_go_annotation.txt
The output:
A C ID.1
A D ID.1
C D ID.1
B E ID.2

With GNU awk for sorted_in and true multi-dimensional arrays:
$ cat tst.awk
{ vals[$2][$1] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in vals) {
for (j in vals[i]) {
for (k in vals[i]) {
if (j != k) {
print j, k, i
}
}
delete vals[i][j]
}
}
}
$ awk -f tst.awk file
A C ID.1
A D ID.1
C D ID.1
B E ID.2

I wonder if this would work (in GNU awk):
$ awk '
($2 in a) && !($1 in a[$2]) { # if ID.x is found in a and A not in a[ID.X]
for(i in a[$2]) # loop all existing a[ID.x]
print i,$1,$2 # and output combination of current and all previous matching
}
{
a[$2][$1] # hash to a
}' file
A C ID.1
A D ID.1
C D ID.1
B E ID.2

in two steps
$ sort -k2 file > file.s
$ join -j2 file.s{,} | awk '!(a[$2,$3]++ + a[$3,$2]++){print $2,$3,$1}'
A C ID.1
A D ID.1
C D ID.1
B E ID.2

If your input is large, it may be faster to solve it in steps, e.g.:
# Create temporary directory for generated data
mkdir workspace; cd workspace
# Split original file
awk '{ print $1 > $2 }' ../infile
# Find all combinations
perl -MMath::Combinatorics \
-n0777aE \
'
$c=Math::Combinatorics->new(count=>2, data=>[#F]);
while(#C = $c->next_combination) {
say join(" ", #C) . " " . $ARGV
}
' *
Output:
C D ID.1
C A ID.1
D A ID.1
B E ID.2

Perl
solution using regex backtracking
perl -n0777E '/^([^ ]*) (.*)\n(?:.*\n)*?([^ ]*) (\2)\n(?{say"$1 $3 $2"})(?!)/mg' foo.txt
flags see perl -h.
^([^ ]*) (.*)\n : matches a line with at least one space first capturing group at the left side of first space, second capturing group the right side.
(?:.*\n)*?: matches (without capturing) 0 or more lines lazily to try following pattern first before matching more lines.
([^ ]*) (\2)\n : similar to first match using backreference \2 to match a line with the same key.
(?{say"$1 $3 $2"}) : code to display the captured groups
(?!) : to make the match fail to backtrack.
Note that it could be shortened a bit
perl -n0777E '/^(\S+)(.+)[\s\S]*?^((?1))(\2)$(?{say"$1 $3$2"})(?!)/mg' foo.txt

Yet another awk making use of the redefinition of $0. This makes the solution of RomanPerekhrest a bit shorter :
{a[$2]=a[$2] FS $1}
END { for(i in a) { $0=a[i]; for(j=1;j<NF;j++)for(k=j+1;k<=NF;++k) print $j,$k,i} }

Related

Run several string substitutions safely

I have to run many substitutions on a text file and I need to distinguish a string that has been written in place of something else from the same string if it was originally there.
For instance, say I want to replace a with b, and b with c in the second field of the following file (to get b c c)
a a
a b
b c
if I run awk '$2 == "a" {$2 = "b"}; $2 == "b" {$2 = "c"} 1' file obviously I get
a c
a c
b c
I could pay attention to the order in which I run the substitutions here, but not really in the real case. I'd like to have a flexible script where I can write the substitutions in any order and not have to worry about values being overwritten. I've tried with an optimistic awk '$2 == "a" {$2 = b}; $2 == "b" {$2 = c}; b = "b"; c = "c"; 1' file but it didn't work.
Since you only want to perform the substitution at most once, you're better off with if ... else if ...
awk '{
if ($2 == "a") {$2 = "b"}
else if ($2 == "b") {$2 = "c"}
else if ($2 == "c") {$2 = "a"}
print
}' <<END
a a
a b
b c
END
a b
a c
b a
Format the code to suit your style.
Another approach that may be more elegant:
awk '
BEGIN {repl["a"] = "b"; repl["b"] = "c"; repl["c"] = "a"}
$2 in repl {$2 = repl[$2]}
1
' <<END
a a
a b
b c
END
The general, idiomatic approach to not changing a string that you just changed is to map the old values to strings that cannot appear in the input and then convert those to the new values:
$ cat tst.awk
BEGIN {
old2new["a"] = "b"
old2new["b"] = "c"
}
{
# Step 1 - put an "X" after every "#" so "#<anything else>"
# cannot exist in the input from this point on.
gsub(/#/,"#X",$2)
# Step 2 - map "old"s to intermediate strings that cannot exist
c=0
for (old in old2new) {
gsub(old,"#"c++,$2)
}
# Step 3 - map the intermediate strings to the new strings
c=0
for (old in old2new) {
gsub("#"c++,old2new[old],$2)
}
# Step 4 - restore the "#X"s to "#"s
gsub(/#X/,"#",$2)
# Step 5 - print the record
print
}
$ awk -f tst.awk file
a b
a c
b c
I used gsub()s as that's the most common application of this but feel free to use ifs if that's more appropriate for your case.
Obviously the approach of just adding concatenating c++ to the end of # only works for up to 10 substitutions, you'd have to come up with a mapping to other characters for more than that (which is trivial but just don't trip over RE metacharacters).

replace every word in a file with the value from another dictionary file

I have a text file mytext.txt, each line of the text is a sentence:
the quick brown fox jumps over the lazy dog
colorless green ideas sleep furiously
Then I have a dictionary file dict.txt like this:
the: A
quick: B
brown: C
fox: D
jumps: E
over: F
lazy: G
dog: H
colorless: I
green: J
ideas: K
sleep: L
furiously: M
I want to replace each word in mytext.txt with the value in dict.txt, like this:
A B C D E F A G H
I J K L M
How can I do it using awk or sed?
If your dict.txt does not have any special chars, a very fast solution is to convert the content of dict.txt into a sed expresion:
sed 's#^#s/#;s#: #/#;s#$#/g;#' dict.txt
will result in
s/the/A/g;
s/quick/B/g;
s/brown/C/g;
s/fox/D/g;
s/jumps/E/g;
s/over/F/g;
s/lazy/G/g;
s/dog/H/g;
s/colorless/I/g;
s/green/J/g;
s/ideas/K/g;
s/sleep/L/g;
s/furiously/M/g;
now this can be used for another sed:
sed -f <(sed 's#^#s/#;s#: #/#;s#$#/g;#' dict.txt) mytext.txt
output:
A B C D E F A G H
I J K L M
But be aware if the dict file contains any characters special to sed / \ . * a.s.o. it wount work
Edit: added the g to sed
Update:
If only whole words should be replaced this will do the trick, because \b will look for word boundarys:
sed -f <(sed 's#^#s/\\b#;s#: #\\b/#;s#$#/g;#' dict.txt) mytext.txt
thx #jm666 for pointing this out.
Edit2:
If the dict.txt file is very long my original version might fail.
The version of #SLePort fixed this, thx.
I previously used "$()" instead of -f <()
$ awk -F'[: ]' 'FNR==NR{a[$1]=$NF;next}{for(i in a)gsub(i,a[i])}1' dist mytext
OR
$ awk -F'[: ]' 'FNR==NR{ a[$1]=$NF; next }
{ for(i=1;i<=NF;i++) if($i in a)$i=a[$i] }1' dist mytext
Input
$ cat mytext
the quick brown fox jumps over the lazy dog
colorless green ideas sleep furiously
$ cat dist
the: A
quick: B
brown: C
fox: D
jumps: E
over: F
lazy: G
dog: H
colorless: I
green: J
ideas: K
sleep: L
furiously: M
Output
$ awk -F'[: ]' 'FNR==NR{a[$1]=$NF;next}{for(i in a)gsub(i,a[i])}1' dist mytext
A B C D E F A G H
I J K L M
$ awk -F'[: ]' 'FNR==NR{a[$1]=$NF; next}
{ for(i=1; i<=NF;i++) if($i in a)$i=a[$i] }1' dist mytext
A B C D E F A G H
I J K L M
here is another alternative with awk and sed
$ sed -f <(awk -F': ' '{print "s/\\b" $1 "\\b/" $2 "/g"}' dict) file
A B C D E F A G H
I J K L M

How to add column indices in bash

I have a text file with a number of rows and columns like this:
a b c d ...
e f g h ...
i j k l ...
...
I want to add column indices for each entry with the output look like this
1:a 2:b 3:c 4:d ...
1:e 2:f 3:g 4:h ...
1:i 2:j 3:k 4:l ...
...
I am wondering if there is a simple way to realize this in bash. Thanks!
With awk:
awk '{for (i=1;i<=NF;i++){printf i":"$i" "};printf "\n"}' file
Output:
1:a 2:b 3:c 4:d 5:...
1:e 2:f 3:g 4:h 5:...
1:i 2:j 3:k 4:l 5:...
1:...
With perl:
perl -lane '$, = " "; print map { (1 + $_) . ":$F[$_]" } 0 .. $#F' file
# or
perl -lane '$, = " "; $i = 1 ; print map { $i++ . ":$_" } #F' file

Linux translate blocks of text to lines of text

I have a file that looks like:
ignoretext
START
a b
c d
e
END
ignoretext
START
f g h
i
END
ignoretext
I want to translate that into rows of:
a b c d e
f g h i
Here is one way to do it with awk
awk '/END/ {ORS=RS;print "";f=0} f; /START/ {ORS=" ";f=1}' file
a b c d e
f g h i
Added a version that does not give space at the end of line. It may be shorter way to do this
awk 'a && !/END/ {printf FS} /END/ {print "";f=a=0} f {printf "%s",$0;a++} /START/ {f=1}'
a b c d e
f g h i
Here is another variant using GNU sed:
sed -n '/START/,/END/{:a;/START/d;/END/!{N;ba};s/\n/ /g;s/ END//;p}' file
a b c d e
f g h i
In a more readable format with explaination:
sed -n ' # Suppress default printing
/START/,/END/ { # For the range between /START/ and /END/
:a; # Create a label a
/START/d # If the line contains START, delete it
/END/! { # Until a line with END is seen
N # Append the next line to pattern space
ba # Branch back to label a to repeat
}
s/\n/ /g # Remove all new lines
s/ END// # Remove the END tag
p # Print the pattern space
}' file
Jotne's awk solution is probably the cleanest, but here's one way you can do it with GNU's version of sed:
sed -ne '/START/,/END/{/\(START\|END\)/!H}' \
-e '/END/{s/.*//;x;s/\n/ /g;s/^ *\| *$//\p}'
$ awk 'f{ if (/END/) {print rec; rec=sep=""; f=0} else {rec = rec sep $0; sep=" "} } /START/{f=1}' file
a b c d e
f g h i

How to copy and append a string to the end of a line

I have a file containing lines like the following:
a b c patch/sample/upgrade.sql
a b c demo/sample/script.sh
I want to be able to copy everything starting from the position after "c" to the last "/" and append it to the end of each line in the file. For example:
a b c patch/sample/upgrade.sql patch/sample
a b c demo/sample/script.sh demo/sample
Does anyone know how I can do this?
if every line starts with a b c, that should do:
$> sed 's#a b c \(.*\)/[^/]*#& \1#g' foo.txt
a b c patch/sample/upgrade.sql patch/sample
a b c demo/sample/script.sh demo/sample
Otherwise, just adapt the first part.
If you can use awk then you can try:
awk '
{
last=x
n=split($4,tmp,/[/]/)
for(i=1;i<n;i++) {
last=last sep tmp[i];
sep="/"
}
$0=$0 FS last
}1' file
$ cat file
a b c patch/sample/upgrade.sql
a b c demo/sample/script.sh
$ awk '
{
last=x
n=split($4,tmp,/[/]/)
for(i=1;i<n;i++) {
last=last sep tmp[i];
sep="/"
}
$0=$0 FS last
}1' file
a b c patch/sample/upgrade.sql patch/sample
a b c demo/sample/script.sh /demo/sample
Taking inspiration from #fredtantini's answer, try the following in a terminal
sed -r 's|(\w+ \w+ \w+ /((.*)?/)*(.*))|\1 \3|' foo.txt
If you want to output this to a file,
sed -r 's|(\w+ \w+ \w+ /((.*)?/)*(.*))|\1 \3|' foo.txt > newFoo.txt

Resources