Linux translate blocks of text to lines of text

Linux translate blocks of text to lines of text - linux

I have a file that looks like:
ignoretext
START
a b
c d
e
END
ignoretext
START
f g h
i
END
ignoretext
I want to translate that into rows of:
a b c d e
f g h i

Here is one way to do it with awk
awk '/END/ {ORS=RS;print "";f=0} f; /START/ {ORS=" ";f=1}' file
a b c d e
f g h i
Added a version that does not give space at the end of line. It may be shorter way to do this
awk 'a && !/END/ {printf FS} /END/ {print "";f=a=0} f {printf "%s",$0;a++} /START/ {f=1}'
a b c d e
f g h i

Here is another variant using GNU sed:
sed -n '/START/,/END/{:a;/START/d;/END/!{N;ba};s/\n/ /g;s/ END//;p}' file
a b c d e
f g h i
In a more readable format with explaination:
sed -n ' # Suppress default printing
/START/,/END/ { # For the range between /START/ and /END/
:a; # Create a label a
/START/d # If the line contains START, delete it
/END/! { # Until a line with END is seen
N # Append the next line to pattern space
ba # Branch back to label a to repeat
}
s/\n/ /g # Remove all new lines
s/ END// # Remove the END tag
p # Print the pattern space
}' file

Jotne's awk solution is probably the cleanest, but here's one way you can do it with GNU's version of sed:
sed -ne '/START/,/END/{/\(START\|END\)/!H}' \
-e '/END/{s/.*//;x;s/\n/ /g;s/^ *\| *$//\p}'

$ awk 'f{ if (/END/) {print rec; rec=sep=""; f=0} else {rec = rec sep $0; sep=" "} } /START/{f=1}' file
a b c d e
f g h i

Related

String Replace | Shell/bash file

I am trying to replace string using sed from sh file.
Issue: After 'connection' it has blank line and its '-url' string comes in the next line, in addition requires to replace port number and password string as well. Using sed I am not able to remove blank line after connection.
Original String:
connection
-url>jdbc:oracle:thin:#10.10.10.11\:1551/password1 /connection-url
Replace with:
connection-url>jdbc:oracle:thin:#10.10.10.90\:1555/password2 /connection-url
I tried below commands which didn't worked:
sed -i 's/connection[\t ]+/,/g' sed-script.sh
sed 's/\connection*-\connection*/-/g' sed-script.sh

Tested with GNU awk.
awk -v RS="\n+{n}" '{$1=$1} 1' Input_file
connection -url>jdbc:oracle:thin:#10.10.10.11\:1551/password1 /connection-url
Could you please try following once.
awk '/^connection/{val=$0;next} NF && /^-url/{print val $0;val=""}' Input_file
Output will be as follows.
connection-url>jdbc:oracle:thin:#10.10.10.11\:1551/password1 /connection-url

You can remove the blank line after 'connection' using tr.
echo <input string> | tr -d "\n"
Where we can see that we want the \n character by running the string through od -c:
0000000 c o n n e c t i o n \n \n - u r l
0000020 > j d b c : o r a c l e : t h i
0000040 n : # 1 0 . 1 0 . 1 0 . 1 1 \ :
0000060 1 5 5 1 / p a s s w o r d 1 /
0000100 c o n n e c t i o n - u r l \n

With sed -
sed -E '
/connection$/,/^-url/ {
/connection$/ { h; d; }
/^$/ d
/^-url/ { H; s/.*//; x; s/\n//g; }
}
' old > new
Assumes no stray whitespace, and that a connection on a line by itself should be followed by a line that starts with -url...

sed processes a line at a time by default; if you want to check whether an empty line follows another line, you have to write a sed script to implement that.
I would go with Awk or Perl instead for this particular task.
perl -p0777 -i -e 's/connection\n\n-url/connection-url/' file
awk '/^connection/ { c=1; next }
c && /^$/ { c++; next }
c && /^-url/ { $1="connection" $1; c=0 }
c { print "connection-";
while(--c) print "" }
1' file >file.new
Perl, like sed, has an -i option to replace the file in-place. GNU Awk has an extension to do the same (look for -i inplace) but it's not portable to lesser Awks.
The Perl -0777 option causes the whole file to be slurped into memory as a single "line", line feeds (\n) and all. If the file is very big, this will obviously be problematic.
The Awk script takes care to put back the lines it skipped if it turned out to be a false match after all.

Making pairs of words based on one column

I want to make pairs of words based on the third column (identifier). My file is similar to this example:
A ID.1
B ID.2
C ID.1
D ID.1
E ID.2
F ID.3
The result I want is:
A C ID.1
A D ID.1
B E ID.2
C D ID.1
Note that I don't want to obtain the same word pair in the opposite order. In my real file some words appear more than one time with different identifiers.
I tried this code which works well but requires a lot of time (and I don't know if there are redundancies):
counter=2
cat filtered_go_annotation.txt | while read f1 f2; do
tail -n +$counter go_annotation.txt | grep $f2 | awk '{print "'$f1' " $1}';
((counter++))
done > go_network2.txt
The 'tail' is used to delete a line when it's read.

Awk solution:
awk '{ a[$2] = ($2 in a? a[$2] FS : "") $1 }
END {
for (k in a) {
len = split(a[k], items);
for (i = 1; i <= len; i++)
for (j = i+1; j <= len; j++)
print items[i], items[j], k
}
}' filtered_go_annotation.txt
The output:
A C ID.1
A D ID.1
C D ID.1
B E ID.2

With GNU awk for sorted_in and true multi-dimensional arrays:
$ cat tst.awk
{ vals[$2][$1] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in vals) {
for (j in vals[i]) {
for (k in vals[i]) {
if (j != k) {
print j, k, i
}
}
delete vals[i][j]
}
}
}
$ awk -f tst.awk file
A C ID.1
A D ID.1
C D ID.1
B E ID.2

I wonder if this would work (in GNU awk):
$ awk '
($2 in a) && !($1 in a[$2]) { # if ID.x is found in a and A not in a[ID.X]
for(i in a[$2]) # loop all existing a[ID.x]
print i,$1,$2 # and output combination of current and all previous matching
}
{
a[$2][$1] # hash to a
}' file
A C ID.1
A D ID.1
C D ID.1
B E ID.2

in two steps
$ sort -k2 file > file.s
$ join -j2 file.s{,} | awk '!(a[$2,$3]++ + a[$3,$2]++){print $2,$3,$1}'
A C ID.1
A D ID.1
C D ID.1
B E ID.2

If your input is large, it may be faster to solve it in steps, e.g.:
# Create temporary directory for generated data
mkdir workspace; cd workspace
# Split original file
awk '{ print $1 > $2 }' ../infile
# Find all combinations
perl -MMath::Combinatorics \
-n0777aE \
'
$c=Math::Combinatorics->new(count=>2, data=>[#F]);
while(#C = $c->next_combination) {
say join(" ", #C) . " " . $ARGV
}
' *
Output:
C D ID.1
C A ID.1
D A ID.1
B E ID.2

Perl
solution using regex backtracking
perl -n0777E '/^([^ ]*) (.*)\n(?:.*\n)*?([^ ]*) (\2)\n(?{say"$1 $3 $2"})(?!)/mg' foo.txt
flags see perl -h.
^([^ ]*) (.*)\n : matches a line with at least one space first capturing group at the left side of first space, second capturing group the right side.
(?:.*\n)*?: matches (without capturing) 0 or more lines lazily to try following pattern first before matching more lines.
([^ ]*) (\2)\n : similar to first match using backreference \2 to match a line with the same key.
(?{say"$1 $3 $2"}) : code to display the captured groups
(?!) : to make the match fail to backtrack.
Note that it could be shortened a bit
perl -n0777E '/^(\S+)(.+)[\s\S]*?^((?1))(\2)$(?{say"$1 $3$2"})(?!)/mg' foo.txt

Yet another awk making use of the redefinition of $0. This makes the solution of RomanPerekhrest a bit shorter :
{a[$2]=a[$2] FS $1}
END { for(i in a) { $0=a[i]; for(j=1;j<NF;j++)for(k=j+1;k<=NF;++k) print $j,$k,i} }

How to grep the first occurrence of a word after a pattern

I have an output of an analysis and I would like to grep a keyword "X" -which always appears- every time a phrase "Y" occurs. The keyword "X" appears many times but I would like to get only the subsequent after "Y".
For example, I would like to get the subsequent Folder name every time Iter = 10 occurs, i.e. F1, F4.
Iter = 10
Folder = F1
Iter = 5
Folder = F2
Iter = 6
Folder = F3
Iter = 10
Folder = F4
Any ideas?
Hexdump -c output of file (as requested by #Inian):
0000000 I t e r = 1 0 \n F o l d
0000010 e r = F 1 \n \n I t e r
0000020 = 5 \n F o l d e r = F 2 \n
0000030 \n I t e r = 6 \n F o l d
0000040 e r = F 3 \n \n I t e r
0000050 = 1 0 \n F o l d e r = F 4
0000060 \n
0000061

You could use awk for this requirement. It works on a /pattern/{action} based rule on each line of the input file. So in our case we first match the string Iter = 10 and enable a flag so that on the next match of the string starting with Folder, we extract the last space de-limited column, which in awk is represented by $NF and we reset the flag for subsequent matches.
awk '/\<Iter = 10\>/{flag=1; next} flag && /^Folder/{print $NF; flag=0;}' file
or without the <> try
awk '/Iter = 10/{flag=1; next} flag && /^Folder/{print $NF; flag=0;}' file

You could also you grep:
$ grep -A 1 Iter.*10 file | grep Folder | grep -o "[^ ]*$"
F1
F4
Explained:
grep -A 1 Iter.*10 file search for desired pattern and get some trailing context (-A 1, just one line)
grep Folder next search for keyword Folder
grep -o "[^ ]*$" get the last part of previous output
If there is noise between Iter and Folder lines you could remove that with grep "\(Iter.*10\|Folder\)" file first.
Above expects for Iter line to appear before Folder line. If that is not the case, awk is the cure. For example, data (line orders vary, there is noise):
Folder = F1
Foo = bar
Iter = 10
Iter = 5
Foo = bar
Folder = F2
$ awk -v RS="" -F"\n" ' # record separated by empty line
/Iter/ && / 10$/ { # look for record with Iter 10
for(i=1;i<=NF;i++) # iterate all fields (lines within record)
if(split($i,a," *") && a[1]=="Folder") # split Folder line to components
print a[3] # output value
}
' file
F1

grep is simply regrex search.
For, doing more complex operation, you can use awk.
E.g.
awk '/Iter = 10/ { getline; print $0 }' /path/to/file
where /path/to/file is the file containing your text to be searched
EDIT:
Just after posting my answer I read Inian's answer and it is more elaborate and accurate.

replace every word in a file with the value from another dictionary file

I have a text file mytext.txt, each line of the text is a sentence:
the quick brown fox jumps over the lazy dog
colorless green ideas sleep furiously
Then I have a dictionary file dict.txt like this:
the: A
quick: B
brown: C
fox: D
jumps: E
over: F
lazy: G
dog: H
colorless: I
green: J
ideas: K
sleep: L
furiously: M
I want to replace each word in mytext.txt with the value in dict.txt, like this:
A B C D E F A G H
I J K L M
How can I do it using awk or sed?

If your dict.txt does not have any special chars, a very fast solution is to convert the content of dict.txt into a sed expresion:
sed 's#^#s/#;s#: #/#;s#$#/g;#' dict.txt
will result in
s/the/A/g;
s/quick/B/g;
s/brown/C/g;
s/fox/D/g;
s/jumps/E/g;
s/over/F/g;
s/lazy/G/g;
s/dog/H/g;
s/colorless/I/g;
s/green/J/g;
s/ideas/K/g;
s/sleep/L/g;
s/furiously/M/g;
now this can be used for another sed:
sed -f <(sed 's#^#s/#;s#: #/#;s#$#/g;#' dict.txt) mytext.txt
output:
A B C D E F A G H
I J K L M
But be aware if the dict file contains any characters special to sed / \ . * a.s.o. it wount work
Edit: added the g to sed
Update:
If only whole words should be replaced this will do the trick, because \b will look for word boundarys:
sed -f <(sed 's#^#s/\\b#;s#: #\\b/#;s#$#/g;#' dict.txt) mytext.txt
thx #jm666 for pointing this out.
Edit2:
If the dict.txt file is very long my original version might fail.
The version of #SLePort fixed this, thx.
I previously used "$()" instead of -f <()

$ awk -F'[: ]' 'FNR==NR{a[$1]=$NF;next}{for(i in a)gsub(i,a[i])}1' dist mytext
OR
$ awk -F'[: ]' 'FNR==NR{ a[$1]=$NF; next }
{ for(i=1;i<=NF;i++) if($i in a)$i=a[$i] }1' dist mytext
Input
$ cat mytext
the quick brown fox jumps over the lazy dog
colorless green ideas sleep furiously
$ cat dist
the: A
quick: B
brown: C
fox: D
jumps: E
over: F
lazy: G
dog: H
colorless: I
green: J
ideas: K
sleep: L
furiously: M
Output
$ awk -F'[: ]' 'FNR==NR{a[$1]=$NF;next}{for(i in a)gsub(i,a[i])}1' dist mytext
A B C D E F A G H
I J K L M
$ awk -F'[: ]' 'FNR==NR{a[$1]=$NF; next}
{ for(i=1; i<=NF;i++) if($i in a)$i=a[$i] }1' dist mytext
A B C D E F A G H
I J K L M

here is another alternative with awk and sed
$ sed -f <(awk -F': ' '{print "s/\\b" $1 "\\b/" $2 "/g"}' dict) file
A B C D E F A G H
I J K L M

How to copy and append a string to the end of a line

I have a file containing lines like the following:
a b c patch/sample/upgrade.sql
a b c demo/sample/script.sh
I want to be able to copy everything starting from the position after "c" to the last "/" and append it to the end of each line in the file. For example:
a b c patch/sample/upgrade.sql patch/sample
a b c demo/sample/script.sh demo/sample
Does anyone know how I can do this?

if every line starts with a b c, that should do:
$> sed 's#a b c \(.*\)/[^/]*#& \1#g' foo.txt
a b c patch/sample/upgrade.sql patch/sample
a b c demo/sample/script.sh demo/sample
Otherwise, just adapt the first part.

If you can use awk then you can try:
awk '
{
last=x
n=split($4,tmp,/[/]/)
for(i=1;i<n;i++) {
last=last sep tmp[i];
sep="/"
}
$0=$0 FS last
}1' file
$ cat file
a b c patch/sample/upgrade.sql
a b c demo/sample/script.sh
$ awk '
{
last=x
n=split($4,tmp,/[/]/)
for(i=1;i<n;i++) {
last=last sep tmp[i];
sep="/"
}
$0=$0 FS last
}1' file
a b c patch/sample/upgrade.sql patch/sample
a b c demo/sample/script.sh /demo/sample

Taking inspiration from #fredtantini's answer, try the following in a terminal
sed -r 's|(\w+ \w+ \w+ /((.*)?/)*(.*))|\1 \3|' foo.txt
If you want to output this to a file,
sed -r 's|(\w+ \w+ \w+ /((.*)?/)*(.*))|\1 \3|' foo.txt > newFoo.txt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux translate blocks of text to lines of text - linux

I have a file that looks like: ignoretext START a b c d e END ignoretext START f g h i END ignoretext I want to translate that into rows of: a b c d e f g h i

Jotne's awk solution is probably the cleanest, but here's one way you can do it with GNU's version of sed: sed -ne '/START/,/END/{/\(START\|END\)/!H}' \ -e '/END/{s/.//;x;s/\n/ /g;s/^ \| *$//\p}'

$ awk 'f{ if (/END/) {print rec; rec=sep=""; f=0} else {rec = rec sep $0; sep=" "} } /START/{f=1}' file a b c d e f g h i

Related

String Replace | Shell/bash file

Making pairs of words based on one column

How to grep the first occurrence of a word after a pattern

replace every word in a file with the value from another dictionary file

How to copy and append a string to the end of a line

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux translate blocks of text to lines of text - linux

I have a file that looks like: ignoretext START a b c d e END ignoretext START f g h i END ignoretext I want to translate that into rows of: a b c d e f g h i

Jotne's awk solution is probably the cleanest, but here's one way you can do it with GNU's version of sed: sed -ne '/START/,/END/{/\(START\|END\)/!H}' \ -e '/END/{s/.*//;x;s/\n/ /g;s/^ *\| *$//\p}'

$ awk 'f{ if (/END/) {print rec; rec=sep=""; f=0} else {rec = rec sep $0; sep=" "} } /START/{f=1}' file a b c d e f g h i

Related

String Replace | Shell/bash file

Making pairs of words based on one column

How to grep the first occurrence of a word after a pattern

replace every word in a file with the value from another dictionary file

How to copy and append a string to the end of a line

Categories

Resources

Jotne's awk solution is probably the cleanest, but here's one way you can do it with GNU's version of sed: sed -ne '/START/,/END/{/\(START\|END\)/!H}' \ -e '/END/{s/.//;x;s/\n/ /g;s/^ \| *$//\p}'