Paste corresponding characters from multiple lines together - linux

I'm writing a linux-command that pasts corresponding characters from multiple lines together. For example: I want to change these lines
A---
-B--
---C
--D-
to this:
A----B-----D--C-
So far, i've made this:
cat sanger.a sanger.c sanger.g sanger.t | cut -c 1
This does the trick for only the first column, but it has to work for all the columns.
Is there anyone who can help?
EDIT: This is a better example. I want this:
SUGAR
HONEY
CANDY
to become
SHC UOA GND AED RYY (without spaces)

Awk way for updated spec
awk -vFS= '{for(i=1;i<=NF;i++)a[i]=a[i]$i}
END{for(i=1;i<=NF;i++)printf "%s",a[i];print ""}' file
Output
A----B-----D--C-
SHCUOAGNNAEDRYY
P.s for a large file this will use lots of memory
A terrible way not using awk, also you need to know the number of fields before hand.
for i in {1..4};do cut -c $i test | tr -d "\n" ; done;echo

Here's a solution without awk or sed, assuming the file is named f:
paste -s -d "" <(for i in $(seq 1 $(wc -L < f)); do cut -c $i f; done)
wc -L is a GNUism which returns the length of the longest line in the input file, which might not work depending on your version/locale. You could instead find the longest line by doing something like:
awk '{if (length > x) {x = length}} END {print x}' f
Then using this value in the seq command instead of the above command substitution.

All right, time for some sed insanity! :D
Disclaimer: If this is for something serious, use something less brittle than this. awk comes to mind. Unless you feel confident enough in your sed abilities to maintain this lunacy.
cat file1 file2 etc | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }' | tr -d '\n'; echo
This comes in three parts: Say you have a file foo.txt
12345
67890
abcde
fghij
then
cat foo.txt | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }'
produces
16af
27bg
38ch
49di
50ej
After that, tr -d '\n' deletes the newlines, and ;echo adds one at the end.
The heart of this madness is the sed code, which is
1h
1!H
$ {
:loop
g
s/$/\n/
s/\([^\n]\)[^\n]*\n/\1/g
p
g
s/^.//
s/\n./\n/g
h
/[^\n]/ b loop
}
This first follows the basic pattern
1h # if this is the first line, put it in the hold buffer
1!H # if it is not the first line, append it to the hold buffer
$ { # if this is the last line,
do stuff # do stuff. The whole input is in the hold buffer here.
}
which assembles all input in the hold buffer before working on it. Once the whole input is in the hold buffer, this happens:
:loop
g # copy the hold buffer to the pattern space
s/$/\n/ # put a newline at the end
s/\([^\n]\)[^\n]*\n/\1/g # replace every line with only its first character
p # print that
g # get the hold buffer again
s/^.// # remove the first character from the first line
s/\n./\n/g # remove the first character from all other lines
h # put that back in the hold buffer
/[^\n]/ b loop # if there's something left other than newlines, loop
And there you have it. I might just have summoned Cthulhu.

Related

How to cut file into chuck

How to get information from specimen1 to specimen3 and paste it into another file 'DNA_combined.txt'?
I tried cut command and awk commend but I found that it is tricky to cutting by paragraph(?) or sequence.
My trial was something like cut -d '>' -f 1-3 dna1.fasta > DNA_combined.txt
You can get the line number for each row using Esc + : and type set nu
Once you get the line number corresponding to each row:
Note down the line number corresponding to Line containing >Specimen 1 (say X) and Specimen 3 (say Y)
Then, use sed command to get the text between two lines
sed -n 'X,Yp' dna1.fasta > DNA_combined.txt
Please let me know if you have any questions.
If you want the first three sequences irrespective of the content after >, you can use this:
$ cat ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
>four
ATGC
>five
GTA
$ awk '/^>/ && ++count==4{exit} 1' ip.txt
>one
ACGTA
TCGAAA
>two
TGACA
>three
ACTG
AAAAC
/^>/ matches the start of a sequence
for such sequences, increment the count variable
if count reaches 4, the exit command will terminate the script
1 idiomatic way to print contents of input record
Would you please try the following:
awk '
BEGIN {print ">Specimen1-3"} # print header
/^>Specimen/ {f = match($0, "^>Specimen[1-3]") ? 1 : 0; next}
# set the flag depending on the number
f # print if f == 1
' dna1.fasta > DNA_combined.txt

how to loop through string for patterns from linux shell?

I have a script that looks through files in a directory for strings like :tagName: which works fine for single :tag: but not for multiple :tagOne:tagTwo:tagThree: tags.
My current script does:
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|.*(:[Aa-Zz]*:)|\1|g' | \
sort -u
printf '\nNote: this fails to display combined :tagOne:tagTwo:etcTag:\n'
The first line is generating an output like this:
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
And the objective is to get that into a list of single :tag:'s.
Again, the problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic.
**I want :tagOne:tagTwo:etcTag: to be turned this into:
:tagOne:
:tagTwo:
:etcTag:
and so forth with :politics:violence: etc.
Colons aren't necessary, tagOne is just as good (maybe better, but this is trivial) than :tagOne:.
The problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic.
So I should replace the sed with something better...
I've tried:
A smarter sed:
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sort -u
...which works (for a limited number of tags) except that it produces weird results like:
:toxicity:p:
:somewhat:y:
:people:n:
...placing weird random letters at the end of some tags in which :p: is the final character of the :leadership: tag and "leadership" no longer appears in the list. Same for :y: and :n:.
I've also tried using loops in a couple ways...
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sort -u | grep lead
...which has the same problem of :leadership: tags being lost etc.
And like...
for m in $(grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd); do
for t in $(echo $m | grep -e ':[Aa-Zz]*:'); do
printf "$t\n";
done
done | sort -u
...which doesn't separate the tags at all, just prints stuff like:
:truama:leadership:business:toxicity
Should I be taking some other approach? Using a different utility (perhaps cut inside a loop)? Maybe doing this in python (I have a few python scripts but don't know the language well, but maybe this would be easy to do that way)? Every time I see awk I think "EEK!" so I'd prefer a non-awk solution please, preferring to stick to paradigms I've used in order to learn them better.
Using PCRE in grep (where available) and positive lookbehind:
$ echo :tagOne:tagTwo:tagThree: | grep -Po "(?<=:)[^:]+:"
tagOne:
tagTwo:
tagThree:
You will lose the leading : but get the tags nevertheless.
Edit: Did someone mention awk?:
$ awk '{
while(match($0,/:[^:]+:/)) {
a[substr($0,RSTART,RLENGTH)]
$0=substr($0,RSTART+1)
}
}
END {
for(i in a)
print i
}' file
Another idea using awk ...
Sample data generated by OPs initial grep:
$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
One awk idea:
awk '
{ split($0,tmp,":") # split input on colon;
# NOTE: fields #1 and #NF are the empty string - see END block
for ( x in tmp ) # loop through tmp[] indices
{ arr[tmp[x]] } # store tmp[] values as arr[] indices; this eliminates duplicates
}
END { delete arr[""] # remove the empty string from arr[]
for ( i in arr ) # loop through arr[] indices
{ printf ":%s:\n", i } # print each tag on separate line leading/trailing colons
}
' tags.raw | sort # sort final output
NOTE: I'm not up to speed on awk's ability to internally sort arrays (thus eliminating the external sort call) so open to suggestions (or someone can copy this answer to a new one and update with said ability?)
The above also generates:
:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:
A pipe through tr can split those strings out to separate lines:
grep -hx -- ':[:[:alnum:]]*:' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
This will also remove the colons and an empty line will be present in the output (easy to repair, note the empty line will always be the first one due to the leading :). Add sort -u to sort and remove duplicates, or awk '!seen[$0]++' to remove duplicates without sorting.
An approach with sed:
sed '/^:/!d;s///;/:$/!d;s///;y/:/\n/' ~/Documents/wiki{,/diary}/*.mkd
This also removes colons, but avoids adding empty lines (by removing the leading/trailing : with s before using y to transliterate remaining : to <newline>). sed could be combined with tr:
sed '/:$/!d;/^:/!d;s///' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
Using awk to work with the : separated fields, removing duplicates:
awk -F: '/^:/ && /:$/ {for (i=2; i<NF; ++i) if (!seen[$i]++) print $i}' \
~/Documents/wiki{,/diary}/*.mkd
Sample data generated by OPs initial grep:
$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
One while/for/printf idea based on associative arrays:
unset arr
typeset -A arr # declare array named 'arr' as associative
while read -r line # for each line from tags.raw ...
do
for word in ${line//:/ } # replace ":" with space and process each 'word' separately
do
arr[${word}]=1 # create/overwrite arr[$word] with value 1;
# objective is to make sure we have a single entry in arr[] for $word;
# this eliminates duplicates
done
done < tags.raw
printf ":%s:\n" "${!arr[#]}" | sort # pass array indices (ie, our unique list of words) to printf;
# per OPs desired output we'll bracket each word with a pair of ':';
# then sort
Per OPs comment/question about removing the array, a twist on the above where we eliminate the array in favor of printing from the internal loop and then piping everything to sort -u:
while read -r line # for each line from tags.raw ...
do
for word in ${line//:/ } # replace ":" with space and process each 'word' separately
do
printf ":%s:\n" "${word}" # print ${word} to stdout
done
done < tags.raw | sort -u # pipe all output (ie, list of ${word}s for sorting and removing dups
Both of the above generates:
:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:

changing two lines of a text file

I have a bash script which gets a text file as input and takes two parameters (Line N° one and line N° two), then changes both lines with each other in the text. Here is the code:
#!/bin/bash
awk -v var="$1" -v var1="$2" 'NR==var {
s=$0
for(i=var+1; i < var1 ; i++) {
getline; s1=s1?s1 "\n" $0:$0
}
getline; print; print s1 s
next
}1' Ham > newHam_changed.txt
It works fine for every two lines which are not consecutive. but for lines which follows after each other (for ex line 5 , 6) it works but creates a blank line between them. How can I fix that?
I think your actual script is not what you posted in the question. I think the line with all the prints contains:
print s1 "\n" s
The problem is that when the lines are consecutive, s1 will be empty (the for loop is skipped), but it will still print a newline before s, producing a blank line.
So you need to make that newline conditional.
awk -v var="4" -v var1="6" 'NR==var {
s=$0
for(i=var+1; i < var1 ; i++) {
getline; s1=s1?s1 "\n" $0:$0
}
getline; print; print (s1 ? s1 "\n" : "") s
next
}1' Ham > newHam_changed.txt
Using getline makes awk scripts always a bit complicated. It is better to prevent the use of getline and just make use of the awk pattern { action } syntax. This will make perfectly readable scripts. In any other language you would just do a loop and get the next line, but in awk I think it is best to make good use of this feature.
awk -v var="$1" -v var1="$2" '
NR==var {s=$0; collect=1; next;}
NR==var1 {collect=0; print; printf inbetween; print s}
collect {inbetween=inbetween""$0"\n"; next;}
1' Ham
Here I capture the first line in s when I found it and set the collect flag. This will trigger the collect block on the next iteration which collects all lines in between. Whenever the second line is found it sets the collect back to zero and prints first the current line, than the inbetween lines and then s. If the lines are consecutive inbetween is empty and printf will than do nothing.
Too complex for my taste, here is something quite simple that achieves the same task:
#!/bin/bash
ORIGFILE='original.txt' # original text file
PROCFILE='processed.txt' # copy of the original file to be proccesed
CHGL1=`sed "$1q;d" $ORIGFILE` # get original $1 line
CHGL2=`sed "$2q;d" $ORIGFILE` # get original $2 line
`cat $ORIGFILE > $PROCFILE`
sed -i "$2s/^.*/$CHGL1/" $PROCFILE # replace
sed -i "$1s/^.*/$CHGL2/" $PROCFILE # replace
More code doesn't mean more useful, keep it simple. This code do not use for and instead goes directly to the specific lines.
EDIT:
A simple way on one line to do this task:
printf '%s\n' 14m26 26-m14- w q | ed -s file
Found in this answer.

sed split single line file and process resulting lines

I have an XML feed (this) in a single line so to extract the data I need I can do something like this:
sed -r 's:<([^>]+)>([^<]+)</\1>:&\n: g' feed | sed -nr '
/<item>/, $ s:.*<(title|link|description)>([^<]+)</\1>.*:\2: p'
since I can't find a way to make first sed call to process result as different lines.
Any advice?
My goal is to get all data I need in a single sed call
sed -rn -e 's|>[[:space:]]*<|>\n<|g
/^<title>/ { bx }
/^<description>/ { b x }
/^<link>/ { bx }
D
:x
s|<([^>]*)>([^\n]*)</\1>|\1=\2|;
P
D' rss.xml
New answer to new question. Now with branches and outputing all three chunks of information.
sed -rn -e 's|>[[:space:]]*<|>\n<|g # Insert newlines before each element
/^[^<]/ D # If not starting with <, delete until 1st \n and restart
/^<[^t]/ D # If not starting with <t, ""
/^<t[^i]/ D # If not starting with <ti, ""
/^<ti[^t]/ D
/^<tit[^l]/ D
/^<titl[^e]/ D
/^<title[^>]/ D # If not starting with <title>, delete until 1st \n and restart
s|^<title>|| # Delete <title>
s|</title>[^\n]*|| # Delete </title> and everything after it until the newline
P # Print everything up to the first newline
D' rss.xml # Delete everything up to the first newline and restart
By "restart" I mean go back to the top of the sed script and pretend we just read whatever is left.
I learned a lot about sed writing this. However, there is zero question that you really should be doing this in perl (or awk if you are old school).
In perl, this would be perl -pe 's%.*?<title>(.*?)</title>(?:.*?(?=<title>)|.*)%$1\n%g' rss.xml
Which is basically taking advantage of the minimal match (.*? is non-greedy, it will match the fewest number of character possible). The positive lookahead thing at the end is just so that I could do it in one s expression while still deleting everything at the end. There is more than one way…
If you needed multiple tags out of this xml file, it probably is still possible, but would probably involve branching and the like.
What about this:
sed -nr 's|>[[:space:]]*<|>\n<|g
h
/^<(title|link|description)>/ {
s:<([^>]+)>([^<]+)</\1>:\2: P
}
g
D
' feed

Getting n-th line of text output

I have a script that generates two lines as output each time. I'm really just interested in the second line. Moreover I'm only interested in the text that appears between a pair of #'s on the second line. Additionally, between the hashes, another delimiter is used: ^A. It would be great if I can also break apart each part of text that is ^A-delimited (Note that ^A is SOH special character and can be typed by using Ctrl-A)
output | sed -n '1p' #prints the 1st line of output
output | sed -n '1,3p' #prints the 1st, 2nd and 3rd line of output
your.program | tail +2 | cut -d# -f2
should get you 2/3 of the way.
Improving Grumdrig's answer:
your.program | head -n 2| tail -1 | cut -d# -f2
I'd probably use awk for that.
your_script | awk -F# 'NR == 2 && NF == 3 {
num_tokens=split($2, tokens, "^A")
for (i = 1; i <= num_tokens; ++i) {
print tokens[i]
}
}'
This says
1. Set the field separator to #
2. On lines that are the 2nd line, and also have 3 fields (text#text#text)
3. Split the middle (2nd) field using "^A" as the delimiter into the array named tokens
4. Print each token
Obviously this makes a lot of assumptions. You might need to tweak it if, for example, # or ^A can appear legitimately in the data, without being separators. But something like that should get you started. You might need to use nawk or gawk or something, I'm not entirely sure if plain awk can handle splitting on a control character.
bash:
read
read line
result="${line#*#}"
result="${result%#*}"
IFS=$'\001' read result -a <<< "$result"
$result is now an array that contains the elements you're interested in. Just pipe the output of the script to this one.
here's a possible awk solution
awk -F"#" 'NR==2{
for(i=2;i<=NF;i+=2){
split($i,a,"\001") # split on SOH
for(o in a ) print o # print the splitted hash
}
}' file

Resources