Remove attributes from XML using sed - linux

First of all, there might be other (better) options, but I'm bound to sed of awk in this case.
I have an XML file with the following contents.
<Field name="field1" type="String">AAAA</Field>
<Field name="field2" type="Integer">0</Field>
<Field name="field4" type="String">BBBB</Field>
Here I would like to change the contents using sed, to get the following result:
<field1>AAAA</field1>
<field2>0</field2>
<field4>BBBB</field4>
So remove the "*Field name="*", the last quote from the name and the rest of the attributes up till the *>* and also I would like to change the last </Field> with the actual field name.
How to approach with awk or sed?
Removing from the first tag works with
sed 's/ type=".*"//'
and
sed 's/Field name="//'
I'm not sure how to proceed with the replacing of the last one.

Using sed
$ sed -E 's~[A-Z][^"]*"([^"]*)[^>]*([^/]*/)[^>]*~\1\2\1~' input_file
<field1>AAAA</field1>
<field2>0</field2>
<field4>BBBB</field4>

Another sed:
sed -E 's/^[^"]+"([^"]+)("[^"]+){2}">([^<]+).*$/<\1>\3<\/\1>/' file.xml

1st solution: With your shown samples please try following sed code. Using -E option to enable ERE(extended regular expression). Using sed's capability to create capturing groups(through regex) and values captured in those capturing groups are being used later in substitution.
sed -E 's/^<Field name="([^"]*)"[^>]*>([^<]*)<.*$/<\1>\2<\/\1>/' Input_file
Here is the Online Demo for used regex for understanding purposes only.
2nd solution: With awk please try following awk code. Written and tested with shown samples. Making field separator as <Field name=, ", > and < for all the lines. In main block printing 3rd and 7th fields along with tags s per required output.
awk -F'^<Field name=|"|>|<' '{print "<"$3">"$7"</"$3">"}' Input_file
3rd solution: With GNU awk using its match function where using regex and its creating capturing groups out of it to store values into array named arr which are being printed later to achieve goal here.
awk '
match($0,/<Field name="([^"]*)"[^"]*"[^>]*>([^<]*)</,arr){
print "<"arr[1]">" arr[2] "</"arr[1]">"
}
' Input_file

as simple as elegant
awk -F "[\"><]" '{print "<"$3">" $7 "<"$3">"}' input_file
explanation
use '','<','>' as delimiter separate each line into several column fields
then print what you need

Related

How to use sed to replace a filename in text file

I have a file:
dynamicclaspath.cfg
VENDOR_JAR=/clear-as-1-d/apps/sterling/jar/struts/2_5_18/1_0_0/log4j-core-2.10.0.jar
VENDOR_JAR=/clear-as-1-d/apps/sterling/jar/log4j/2_17_1/log4j-core-2.10.0.jar
I want to replace any occurrence of log4j-core* with log4j-core-2.17.1.jar
I tried this but I know I'm missing a regex:
sed -i '/^log4j-core/ s/[-]* /log4j-core-2.17.1.jar/'
With your shown samples please try following sed program. Using -E option with sed to enable ERE(extended regular expressions) with it. In main program using substitute option to perform substitution. Using sed's capability to use regex and store matched values into temp buffer(capturing groups). Matching till last occurrence of / and then matching log4j-core till jar at last of value. While substituting it with 1st capturing group value(till last occurrence of /) followed by new value of log4j as per OP's requirement.
sed -E 's/(^.*\/)log4j-core-.*\.jar$/\1log4j-core-2.17.1.jar/' Input_file
Using sed
$ sed -E 's/(log4j-core-)[0-9.]+/\12.17.1./' input_file
VENDOR_JAR=/clear-as-1-d/apps/sterling/jar/struts/2_5_18/1_0_0/log4j-core-2.17.1.jar
VENDOR_JAR=/clear-as-1-d/apps/sterling/jar/log4j/2_17_1/log4j-core-2.17.1.jar
It depends on possible other contents in your input file how specific the search pattern must be.
sed 's/log4j-core-.*\.jar/log4j-core-2.17.1.jar/' inputfile
or
sed 's/log4j-core-[0-9.]*\.jar/log4j-core-2.17.1.jar/' inputfile
or (if log4j-core*.jar is always the last part of the line)
sed 's/log4j-core.*/log4j-core-2.17.1.jar/' inputfile
sed -i s'#2.10.0.jar$#2.17.1.jar#'g file
That seems to work.

Extract text after last delimiter and attach at end of line [Linux/Ubuntu]

I have a fasta file that looks like below:
>sequence_1_g1
ATTTCGGATAA
>sequence_2_g1
AGGCTCTAGGA
>sequence_2_g2
TGTTCTGAAAT
>sequence_2_g3
CACCTCGGAGT
>sequence_3_new_g1
GCGGATAAAGC
I'd like to only extract the numbers that comes after the last delimiter and attach them to the end of each header, so that the output would look like below:
>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC
I've never used linux before and so far I've only been able to find this command to separate the text that comes after the last delimiter: sed -E 's/.*_//' filename.fasta . Can anyone give suggestions on what commands I should look for in addition to get my desired output?
You may try this sed that searches > at line start and if there is a match then it matches 1+ digit at end and replaces with number_number substring expression:
sed -E '/^>/s/[0-9]+$/&_&/' file
>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC
1st solution: With your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of it.
awk '/^>/{$0=$0 "_" substr($0,length($0))} 1' Input_file
2nd solution: Using GNU awk's match function with regex and capturing group's values please try following.
awk 'match($0,/^>.*([0-9]+)$/,arr){$0=$0"_"arr[1]} 1' Input_file
3rd solution: Assuming if your lines always have _g separated in lines which are getting started from > then we can simply try following awk code also.
awk -F'_g' '/^>/{$0=$0"_"$2} 1' Input_file
4th solution: If in case perl one-liner is accepted you could simply use perl's capability of capturing groups(which will be created if a regex is having true match).
perl -pe 's/(^>.*)([0-9]+$)/\1\2_\2/' Input_file
Using sed
$ sed -E 's/.*_.([0-9]+)/&_\1/' input_file
>sequence_1_g1_1
ATTTCGGATAA
>sequence_2_g1_1
AGGCTCTAGGA
>sequence_2_g2_2
TGTTCTGAAAT
>sequence_2_g3_3
CACCTCGGAGT
>sequence_3_new_g1_1
GCGGATAAAGC

How to get the rest of the Pattern using any linux command?

I am try to update a file and doing some transformation using any linux tool.
For example, here I am trying with awk.
Would be great to know how to get the rest of the pattern?
awk -F '/' '{print $1"/raw"$2}' <<< "string1/string2/string3/string4/string5"
string1,rawstring2
here I dont know how many "/" is there and I want to get the output:
string1/rawstring2/string3/string4/string5
Something like
awk -F/ -v OFS=/ '{ $2 = "raw" $2 } 1' <<< "string1/string2/string3/string4/string5"
Just modify the desired field, and print out the changed line (Have to set OFS so it uses a slash instead of a space to separate fields on output, and a pattern of 1 uses the default action of printing $0. It's an idiom you'll see a lot of with awk.)
Also possible with sed:
sed -E 's|([^/]*/)|\1raw|' <<< "string1/string2/string3/string4/string5"
The \1 in the replacement string reproduces the bit inside parenthesis and appends raw to it.
Equivalent to
sed 's|\([^/]*/\)|\1raw|' <<< "string1/string2/string3/string4/string5"

Find a line and modify it in a csv file given an input

I have a csv file with a list of workers and I wanna make an script for modify their work group given their ID's. Lines in CSV files are like this:
Before:
ID TAG GROUP
niub16677500;B00;AB0
After:
ID TAG GROUP
niub16677500;B00;BC0
How I can make this?
I'm working with awk and sed commands but I couldn't get anything at the moment.
With awk:
awk -F';' -v OFS=';' -v id="niub16677500" -v new_group="BC0" '{if($1==id)$3=new_group}1' input.csv
ID;TAG;GROUP
niub16677500;B00;BC0
Redirect the output to a file and note that the csv header should use the same field separator as the body.
Explanations:
-F';' to have input field separator as ;
-v OFS=';' same for the output FS
-v id="niub16677500" -v new_group="BC0" define the variables that you are going to use in the awk commands
'{if($1==id)$3=new_group}1' when the first column is equal to the value contained in variable id the overwrite the 3rd field and print the line
With sed:
id="niub16677500"; new_group="BC0"; sed "/^$id/s/;[^;]*$/;$new_group/" input.csv
ID;TAG;GROUP
niub16677500;B00;BC0
You can either do an inline change using -i.bak option, or redirect the output to a file.
Explanations:
Store the values in 2 variables
/^$id/ when you reach a line that starts with the ID store in the variable id, run sed search and replace
s/;[^;]*$/;$new_group/ search and replace command that will replace the last field by the new value
Sed can do it,
echo 'niub16677500;B00;AB0' | sed 's/\(^niub16677500;...;\)\(...\)$/\1BC0/'
will replace the AB0 group in your example with BC0, by matching the user name, semicolon, whatever 3 characters and another semicolon, and then matching the remaining 3 characters. Then as an output it repeats the first match with \1 and adds BC0.
You can use :
sed 's/\(^niub16677500;...;\)\(...\)$/\1BC0/' <old_file >new_file
to make a new_file with this change.
https://www.grymoire.com/Unix/Sed.html is a great resource, you should take a look at it.

How to use grep and sed in order to replace the substring after searching some specific string?

I want to know how to use two 'grep' and 'sed' utilities or something else in order to replace the substring. I will explain what I want to do below.
We have the file 'test.txt' with the following string:
A1='AA1', A2='AA2', A3='AA3', A4='AA4', A5{ATTR}='AA5', A6='keyword_A'
After searching 'keyword_A' using grep, I want to replace the value of A5 with other string, for example, "NEW".
A1='AA1', A2='AA2', A3='AA3', A4='AA4', A5{ATTR}='NEW', A6='keyword_A'
I tried to use two commands like
grep keyword_A test.txt | sed -e 's/blabla/blabla/'
After trying all I know, I gave up at all.
Please let me know the right solution.
First, you never need grep and sed. Sed has a full regular-expression search engine, so it is a superset of grep. This command will read test.txt, change the lines that you've indicated, and print the entire result on standard output:
sed "/keyword_A/s/A5{ATTR}='[A-Z0-9]*'/A5{ATTR}='NEW'/g" < test.txt
If you want to store the results back into the file test.txt, use the -i (in-place editing) switch to sed:
sed "/keyword_A/s/A5{ATTR}='[A-Z0-9]*'/A5{ATTR}='NEW'/g" -i.bak test.txt
If you want to select only the indicated lines, modify those, and print only those lines to standard out, use a combination of the p (print) command and the -n (no output) switch.
sed "/keyword_A/s/A5{ATTR}='[A-Z0-9]*'/A5{ATTR}='NEW'/gp" -n test.txt
Using grep+sed is always the wrong approach. Here's one way to do it with GNU awk:
$ awk '/keyword_A/{ $0=gensub(/(A5({[^}]+})?=\047)[^\047]+/,"\\1NEW",1) } 1' file
A1='AA1', A2='AA2', A3='AA3', A4='AA4', A5{ATTR}='NEW', A6='keyword_A'
Using a couple variables you could define the keyword and replacement ( if they change at all ):
q="keyword_A"
r="NEW"
Then with sed:
sed -r "s/^(.+\{.+\}=')(.+)('.+"${q}".+)$/\1"${r}"\3/" file
Result:
A1='AA1', A2='AA2', A3='AA3', A4='AA4', A5{ATTR}='NEW', A6='keyword_A'
A5="NEW"
A6="keyword_A"
# with sed
sed "s/='[^']*\(',[[:blank:]]*A6='${A6}'\)/='${A5}\1/" YourFile
# with awk
awk -F "'" -v A5="${A5}" -v A6="${A6}" '
BEGIN { OFS="\047" }
$12 == A6 { $10 = A5; $0 = $0 }
7
' YourFile
Change by the end of the string, for sed and using ' as field separator in awk instead of traditional space.
assuming there is no ' in value (or need to treat the escaping method) for awk version
We can just directly replace the fifth column when the sting keyword_A is found as shown below:
awk -F, 'BEGIN{OFS=",";}/keyword_A/{$5="A5{ATTR}='"'"NEW"'"'"}1' filename
Couple of slight alternatives:
sed -r "/keyword_A/s/(A5[^']*')[^']*/\1NEW/"
awk -F"'" '/keyword_A/{$10 = "NEW"}1' OFS="'"
Of course the negative with awk is afterwards you would have to rename the new file.

Resources