Extract text after last delimiter and attach at end of line [Linux/Ubuntu] - linux

I have a fasta file that looks like below:
>sequence_1_g1
ATTTCGGATAA
>sequence_2_g1
AGGCTCTAGGA
>sequence_2_g2
TGTTCTGAAAT
>sequence_2_g3
CACCTCGGAGT
>sequence_3_new_g1
GCGGATAAAGC
I'd like to only extract the numbers that comes after the last delimiter and attach them to the end of each header, so that the output would look like below:
>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC
I've never used linux before and so far I've only been able to find this command to separate the text that comes after the last delimiter: sed -E 's/.*_//' filename.fasta . Can anyone give suggestions on what commands I should look for in addition to get my desired output?

You may try this sed that searches > at line start and if there is a match then it matches 1+ digit at end and replaces with number_number substring expression:
sed -E '/^>/s/[0-9]+$/&_&/' file
>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC

1st solution: With your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of it.
awk '/^>/{$0=$0 "_" substr($0,length($0))} 1' Input_file
2nd solution: Using GNU awk's match function with regex and capturing group's values please try following.
awk 'match($0,/^>.*([0-9]+)$/,arr){$0=$0"_"arr[1]} 1' Input_file
3rd solution: Assuming if your lines always have _g separated in lines which are getting started from > then we can simply try following awk code also.
awk -F'_g' '/^>/{$0=$0"_"$2} 1' Input_file
4th solution: If in case perl one-liner is accepted you could simply use perl's capability of capturing groups(which will be created if a regex is having true match).
perl -pe 's/(^>.*)([0-9]+$)/\1\2_\2/' Input_file

Using sed
$ sed -E 's/.*_.([0-9]+)/&_\1/' input_file
>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC

Related

Remove attributes from XML using sed

First of all, there might be other (better) options, but I'm bound to sed of awk in this case.
I have an XML file with the following contents.
<Field name="field1" type="String">AAAA</Field>
<Field name="field2" type="Integer">0</Field>
<Field name="field4" type="String">BBBB</Field>
Here I would like to change the contents using sed, to get the following result:
<field1>AAAA</field1>
<field2>0</field2>
<field4>BBBB</field4>
So remove the "*Field name="*", the last quote from the name and the rest of the attributes up till the *>* and also I would like to change the last </Field> with the actual field name.
How to approach with awk or sed?
Removing from the first tag works with
sed 's/ type=".*"//'
and
sed 's/Field name="//'
I'm not sure how to proceed with the replacing of the last one.
Using sed
$ sed -E 's~[A-Z][^"]*"([^"]*)[^>]*([^/]*/)[^>]*~\1\2\1~' input_file
<field1>AAAA</field1>
<field2>0</field2>
<field4>BBBB</field4>
Another sed:
sed -E 's/^[^"]+"([^"]+)("[^"]+){2}">([^<]+).*$/<\1>\3<\/\1>/' file.xml
1st solution: With your shown samples please try following sed code. Using -E option to enable ERE(extended regular expression). Using sed's capability to create capturing groups(through regex) and values captured in those capturing groups are being used later in substitution.
sed -E 's/^<Field name="([^"]*)"[^>]*>([^<]*)<.*$/<\1>\2<\/\1>/' Input_file
Here is the Online Demo for used regex for understanding purposes only.
2nd solution: With awk please try following awk code. Written and tested with shown samples. Making field separator as <Field name=, ", > and < for all the lines. In main block printing 3rd and 7th fields along with tags s per required output.
awk -F'^<Field name=|"|>|<' '{print "<"$3">"$7"</"$3">"}' Input_file
3rd solution: With GNU awk using its match function where using regex and its creating capturing groups out of it to store values into array named arr which are being printed later to achieve goal here.
awk '
match($0,/<Field name="([^"]*)"[^"]*"[^>]*>([^<]*)</,arr){
print "<"arr[1]">" arr[2] "</"arr[1]">"
}
' Input_file
as simple as elegant
awk -F "[\"><]" '{print "<"$3">" $7 "<"$3">"}' input_file
explanation
use '','<','>' as delimiter separate each line into several column fields
then print what you need

How to use sed to replace a filename in text file

I have a file:
dynamicclaspath.cfg
VENDOR_JAR=/clear-as-1-d/apps/sterling/jar/struts/2_5_18/1_0_0/log4j-core-2.10.0.jar
VENDOR_JAR=/clear-as-1-d/apps/sterling/jar/log4j/2_17_1/log4j-core-2.10.0.jar
I want to replace any occurrence of log4j-core* with log4j-core-2.17.1.jar
I tried this but I know I'm missing a regex:
sed -i '/^log4j-core/ s/[-]* /log4j-core-2.17.1.jar/'
With your shown samples please try following sed program. Using -E option with sed to enable ERE(extended regular expressions) with it. In main program using substitute option to perform substitution. Using sed's capability to use regex and store matched values into temp buffer(capturing groups). Matching till last occurrence of / and then matching log4j-core till jar at last of value. While substituting it with 1st capturing group value(till last occurrence of /) followed by new value of log4j as per OP's requirement.
sed -E 's/(^.*\/)log4j-core-.*\.jar$/\1log4j-core-2.17.1.jar/' Input_file
Using sed
$ sed -E 's/(log4j-core-)[0-9.]+/\12.17.1./' input_file
VENDOR_JAR=/clear-as-1-d/apps/sterling/jar/struts/2_5_18/1_0_0/log4j-core-2.17.1.jar
VENDOR_JAR=/clear-as-1-d/apps/sterling/jar/log4j/2_17_1/log4j-core-2.17.1.jar
It depends on possible other contents in your input file how specific the search pattern must be.
sed 's/log4j-core-.*\.jar/log4j-core-2.17.1.jar/' inputfile
or
sed 's/log4j-core-[0-9.]*\.jar/log4j-core-2.17.1.jar/' inputfile
or (if log4j-core*.jar is always the last part of the line)
sed 's/log4j-core.*/log4j-core-2.17.1.jar/' inputfile
sed -i s'#2.10.0.jar$#2.17.1.jar#'g file
That seems to work.

Replacing characters in each line on a file in linux

I have a file with different word in each line.
My goal is to replace the first character to a capital letter and replace the 3rd character to "#".
For example: football will be exchanged to Foo#ball.
I tried thinking about using awk and sed.It didn't help me since (to my knowledge) sed needs an exact character input and awk can print the desired character but not change it.
With GNU sed and two s commands:
echo 'football' | sed -E 's/(.)/\U\1/; s/(...)./\1#/'
Output:
Foo#ball
See: 3.3 The s Command, 5.7 Back-references and Subexpressions and 5.9.2 Upper/Lower case conversion
This might work for you (GNU sed):
sed 's/\(...\)./\u\1#/' file
With bash you can use parameter expansions alone to accomplish the task. For example, if you read each line into the variable line, you can do:
line="${line^}" # change football to Football (capitalize 1st char)
line="${line:0:3}#${line:4}" # make 4th character '#'
Example Input File
$ cat file
football
soccer
baseball
Example Use/Output
$ while read -r line; do line="${line^}"; echo "${line:0:3}#${line:4}"; done < file
Foo#ball
Soc#er
Bas#ball
While shell is typically slower, when use is limited to builtins, it doesn't fall too far behind.
(note: your question says 3rd character, but your example replaces the 4th character with '#')
With GNU awk for the 3rd arg to match():
$ echo 'football' | awk 'match($0,/(.)(..).(.*)/,a){$0=toupper(a[1]) a[2] "#" a[3]} 1'
Foo#ball
Cyrus' or Potong's answers are the preferred ones. (For Linux or systems with GNU sed because of \U or \u.)
This is just an additional solution with awk because you mentioned it and used also awk tag:
$ echo 'football'|awk '{a=substr($0,1,1);b=substr($0,2,2);c=substr($0,5);print toupper(a)b"#"c}'
Foo#ball
This is a most simple solution without RegEx. It will also work on non-GNU awk.
This should work with any version of awk:
awk '{
for(i=1;i<=NF;i++){
# Note that string indexes start at 1 in awk !
$i=toupper(substr($i,1,1)) "" substr($i,2,1) "#" substr($i,3)
}
print
}' file
Note: If a word is less than 3 characters long, like it, it will be printed as It#
if your data in 'd' file, tried on gnu sed:
sed -E 's/^(\w)(\w\w)\w/\U\1\E\2#/' d

Change some field separators in awk

I have a input file
1.txt
joshwin_xc8#yahoo.com:1802752:2222:
ihearttofurkey#yahoo.com:1802756:111113
www.rothmany#mail.com:xxmyaduh:13#;:3A
and I want an output file:
out.txt
joshwin_xc8#yahoo.com||o||1802752||o||2222:
ihearttofurkey#yahoo.com||o||1802756||o||111113
www.rothmany#mail.com||o||xxmyaduh||o||13#;:3A
I want to replace the first two ':' in 1.txt with '||o||', but with the script I am using
awk -F: '{print $1,$2,$3}' OFS="||o||" 3.txt
But it is not giving the expected output.
Any help would be highly appreciated.
Perl solution:
perl -pe 's/:/||o||/ for $_, $_' 1.txt
-p reads the input line by line and prints each line after processing it
s/// is similar to substitution you might know from sed
for in postposition runs the previous command for every element in the following list
$_ keeps the line being processed
For higher numbers, you can use for ($_) x N where N is the number. For example, to substitute the first 7 occurrences:
perl -pe 's/:/||o||/ for ($_) x 7' 1.txt
Following sed may also help you in same.
sed 's/:/||o||/;s/:/||o||/' Input_file
Explanation: Simply substituting 1st occurrence of colon with ||o|| and then 2nd occurrence of colon now becomes 1st occurrence of colon now and substituting that colon with ||o|| as per OP's requirement.
Perl solution also, but I think the idea can apply to other languages: using the limit parameter of split:
perl -nE 'print join q(||o||), split q(:), $_, 3' file
(q quotes because I'm on Windows)
Suppose if we need to replace first 2 occurrence of : use below code
Like this you can change as per your requirement suppose if you need to change for first 7 occurences change {1..2} to {1..7}.
Out put will be saved in orginal file. it wont display the output
for i in {1..2}
> do
> sed -i "s/:/||o||/1" p.txt
> done

How to use grep and sed in order to replace the substring after searching some specific string?

I want to know how to use two 'grep' and 'sed' utilities or something else in order to replace the substring. I will explain what I want to do below.
We have the file 'test.txt' with the following string:
A1='AA1', A2='AA2', A3='AA3', A4='AA4', A5{ATTR}='AA5', A6='keyword_A'
After searching 'keyword_A' using grep, I want to replace the value of A5 with other string, for example, "NEW".
A1='AA1', A2='AA2', A3='AA3', A4='AA4', A5{ATTR}='NEW', A6='keyword_A'
I tried to use two commands like
grep keyword_A test.txt | sed -e 's/blabla/blabla/'
After trying all I know, I gave up at all.
Please let me know the right solution.
First, you never need grep and sed. Sed has a full regular-expression search engine, so it is a superset of grep. This command will read test.txt, change the lines that you've indicated, and print the entire result on standard output:
sed "/keyword_A/s/A5{ATTR}='[A-Z0-9]*'/A5{ATTR}='NEW'/g" < test.txt
If you want to store the results back into the file test.txt, use the -i (in-place editing) switch to sed:
sed "/keyword_A/s/A5{ATTR}='[A-Z0-9]*'/A5{ATTR}='NEW'/g" -i.bak test.txt
If you want to select only the indicated lines, modify those, and print only those lines to standard out, use a combination of the p (print) command and the -n (no output) switch.
sed "/keyword_A/s/A5{ATTR}='[A-Z0-9]*'/A5{ATTR}='NEW'/gp" -n test.txt
Using grep+sed is always the wrong approach. Here's one way to do it with GNU awk:
$ awk '/keyword_A/{ $0=gensub(/(A5({[^}]+})?=\047)[^\047]+/,"\\1NEW",1) } 1' file
A1='AA1', A2='AA2', A3='AA3', A4='AA4', A5{ATTR}='NEW', A6='keyword_A'
Using a couple variables you could define the keyword and replacement ( if they change at all ):
q="keyword_A"
r="NEW"
Then with sed:
sed -r "s/^(.+\{.+\}=')(.+)('.+"${q}".+)$/\1"${r}"\3/" file
Result:
A1='AA1', A2='AA2', A3='AA3', A4='AA4', A5{ATTR}='NEW', A6='keyword_A'
A5="NEW"
A6="keyword_A"
# with sed
sed "s/='[^']*\(',[[:blank:]]*A6='${A6}'\)/='${A5}\1/" YourFile
# with awk
awk -F "'" -v A5="${A5}" -v A6="${A6}" '
BEGIN { OFS="\047" }
$12 == A6 { $10 = A5; $0 = $0 }
7
' YourFile
Change by the end of the string, for sed and using ' as field separator in awk instead of traditional space.
assuming there is no ' in value (or need to treat the escaping method) for awk version
We can just directly replace the fifth column when the sting keyword_A is found as shown below:
awk -F, 'BEGIN{OFS=",";}/keyword_A/{$5="A5{ATTR}='"'"NEW"'"'"}1' filename
Couple of slight alternatives:
sed -r "/keyword_A/s/(A5[^']*')[^']*/\1NEW/"
awk -F"'" '/keyword_A/{$10 = "NEW"}1' OFS="'"
Of course the negative with awk is afterwards you would have to rename the new file.

Resources