I have shell script for split xml files. but have one million xml files in Customer environment。the script running slow。could run Multithreading mode ?
Thanks!
my shell script:
#!/bin/sh
File=/home/spark/PktLog
count=0
startLine=(`sed -n -e '/?xml version="1.0" encoding/=' $File`)
fileEnd=`sed -n '$=' $File`
endLine=(`echo ${startLine[*]} | awk -v a=$fileEnd '{for(i=2;i<=NF;i++) printf("%d ",$i-1);print a}'`)
let maxIndex=${#startLine[#]}-1
for n in `seq 0 $maxIndex`
do
sed -n "${startLine[$n]},${endLine[$n]}p" $File >result_${n}.xml
done
echo $startLine[#]`enter code here`
Your method is very slow because it reads the input file many times.
Instead of trying to make it faster with multithreading, you should rewrite the script to only read the input file one time.
Here is an example input file:
$ cat testfile
<?xml version="1.0" encoding="UTF-8"?>
<test>
<some data />
</test>
<?xml version="1.0" encoding="UTF-8"?>
<test>
<more />
<data />
</test>
<?xml version="1.0" encoding="UTF-8"?>
<test>
<more type="data" />
</test>
Here is an awk command that reads the file one time, and writes each document to a separate file:
$ awk 'BEGIN { file="/dev/null"; n=0; }
/xml version="1.0" encoding/ {
close(file);
file="file" ++n ".xml";
}
{print > file;}' testfile
Here is the result:
$ cat file1.xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
<some data />
</test>
$ cat file2.xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
<more />
<data />
</test>
This is much faster:
$ grep -c 'xml version' PktLog
3000
$ time ./yourscript
real 0m9.791s
user 0m6.849s
sys 0m2.660s
$ time ./thisscript
real 0m0.248s
user 0m0.130s
sys 0m0.107s
Related
I have a xml file, I want to repace the text value in the tag < jdbcurl > with another value, but there are two tags named with jdbcurl nested in different pool id.
Can any one do me a favor to dig it with SED?
Thanks.
<?xml version="1.0" ?>
<WEBServer fileName="webdb.xml" name="Configuration and Security File">
<security>
<pool id="DEFAULT" jndiName="jdbc/webdb">
<dbschema></dbschema>
<userID>DBUSER</userID>
<password>passwd1</password>
<jdbcdriver>oracle.jdbc.driver.OracleDriver</jdbcdriver>
<jdbcurl>jdbc:oracle:thin:#db.server.com:1753/ORCSN</jdbcurl>
</pool>
<pool id="bi_id" jndiName="jdbc/bidb">
<dbschema></dbschema>
<userID>BIUSER</userID>
<password>passwd2</password>
<jdbcdriver>oracle.jdbc.driver.OracleDriver</jdbcdriver>
<jdbcurl>jdbc:oracle:thin:#db.server.com:1753/ORCSN</jdbcurl>
</pool>
</security>
</WEBServer>
sed -E '/bi_id/,/pool/ s/jdbc:[^<]*/you will replace/g' filename
this one will replace jdbc in pool with id='bi_id'
sed -E '/DEFAULT/,/pool/ s/jdbc:[^<]*/you will replace/g'
this is for DEFAULT pool's jdbcurl
With xmlstarlet:
xmlstarlet edit --update '//WEBServer/security/pool[#id="DEFAULT"]/jdbcurl' --value 'XYZ' file.xml
Output:
<?xml version="1.0"?>
<WEBServer fileName="webdb.xml" name="Configuration and Security File">
<security>
<pool id="DEFAULT" jndiName="jdbc/webdb">
<dbschema/>
<userID>DBUSER</userID>
<password>passwd1</password>
<jdbcdriver>oracle.jdbc.driver.OracleDriver</jdbcdriver>
<jdbcurl>XYZ</jdbcurl>
</pool>
<pool id="bi_id" jndiName="jdbc/bidb">
<dbschema/>
<userID>BIUSER</userID>
<password>passwd2</password>
<jdbcdriver>oracle.jdbc.driver.OracleDriver</jdbcdriver>
<jdbcurl>jdbc:oracle:thin:#db.server.com:1753/ORCSN</jdbcurl>
</pool>
</security>
</WEBServer>
<scene name="scene_1_Overview" title="1 Overview" onstart="" thumburl="panos/1_Overview.tiles/thumb.jpg" lat="" lng="" heading="">
abc
</scene>
<scene name="scene_1_Overview" title="10 Overview" onstart="" thumburl="panos/1_Overview.tiles/thumb.jpg" lat="" lng="" heading="">
abc
</scene>
<scene name="scene_10_Room_Balcony_View" title="2 Room Balcony View" onstart="" thumburl="panos/10_Room_Balcony_View.tiles/thumb.jpg" lat="" lng="" heading="">
abc
def
</scene>
Saying that I have such a XML file as above.
Now I need to make the three elements in order according to the numbers followed by title=, which are 1, 10 and 2.
I'm considering using bash script to do this.
I can use things like awk '{print $3}' test | awk -F "\"" '{print $2}' to get the three numbers but I don't know how to read multiple lines from each <scene to </scene>, to make them in order and overwrite them.
I think doing this in awk is not the greatest idea, but I know what it's like being stuck on a box where you lack access to install anything. If you are stuck with it then something like the following awk script should get you in the ballpark.
awk -F"[\" ]" '$0~/title/{title=$6} {scene[title]=scene[title]$0"\n"} END{PROCINFO["sorted_in"]="#ind_num_asc"; for (title in scene) {print scene[title]}}' inFile
Here awk is:
Splitting each line by either " or (-F"[\" ]")
If the line contains the word "title" ($0~/title/), then it sets the variable title to whatever it finds in field 6 (title=$6;) which might change if your "name" contains spaces since we are splitting on that so you might have to monkey with the delimiters.
Next it stores the contents of the line, followed by a linefeed, into the array scenes at the index set by the number stored in title ({scene[title]=scene[title]$0"\n"})
Once it's done processing the file it sets the PROCINFO["sorted_in"] setting to #ind_num_asc which tells awk to loop through arrays using the index, while forcing the index to act as a number (END{PROCINFO["sorted_in"]="#ind_num_asc")
Then we loop through the array and print each element (for (title in scene) {print scene[title]})
Minimized a bit:
awk -F"[\" ]" '$0~/title/{t=$6}{s[t]=s[t]$0"\n"}END{PROCINFO["sorted_in"]="#ind_num_asc";for(t in s)print s[t]}' inFile
Using xsltproc
$ xsltproc sort.xslt scenes.xml
sort.xslt
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*" />
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()">
<xsl:sort select="scene/#title" />
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Otherwise the following perl one-liner
perl -e '
undef$/;
print "$_\n" for sort{
($a=~/title="(.*?)"/ms)[0] cmp ($b=~/title="(.*?)"/ms)[0]
}<>=~/<scene[ >].*?<\/scene>/gms
' scenes.xml
Could anyone please help me in the groovy code for this requirement. I have an XML input such as below:
<?xml version="1.0" encoding="UTF-8"?>
<result>
<records>
<dataProcessed>
<FieldName>Tesco</FieldName>
<Mode>As Is</Mode>
</dataProcessed>
<dataProcessed>
<FieldName>ASDA|Tesco|Walmart</FieldName>
<Mode>Split</Mode>
</dataProcessed>
</records>
<records>
<dataProcessed>
<FieldName>Orange|MTS</FieldName>
<Mode>Break</Mode>
</dataProcessed>
</records>
</result>
When the value of field Mode is either Split or Break, then I need to spilt the segment using pipe delimiter, and I need to change the value of field Mode to 1,2 etc. based on the splitting.
<?xml version="1.0" encoding="UTF-8"?>
<result>
<records>
<dataProcessed>
<FieldName>Tesco</FieldName>
<Mode>As Is</Mode>
</dataProcessed>
<dataProcessed>
<FieldName>ASDA</FieldName>
<Mode>1</Mode>
</dataProcessed>
<dataProcessed>
<FieldName>Tesco</FieldName>
<Mode>2</Mode>
</dataProcessed>
<dataProcessed>
<FieldName>Walmart</FieldName>
<Mode>3</Mode>
</dataProcessed>
</records>
<records>
<dataProcessed>
<FieldName>Orange</FieldName>
<Mode>1</Mode>
</dataProcessed>
<dataProcessed>
<FieldName>MTS</FieldName>
<Mode>2</Mode>
</dataProcessed>
</records>
</result>
Loop through the dataProcessed nodes and then for each, check the value of Mode and act accordingly on the nodes.
UPDATE
This is my file:
<department name="/fighters" id="123879" group="channel" case="none" use="no">
<options index_name="index.html" listing="0" sum="no" allowed="no" />
<target prefix="ttp" suffix=".net" />
<type="effort">
<region="20491" readonly="fs1a" readwrite="fs1a" upload="yes" download="yes" repl="yes" hard="0" soft"0" prio="0" write="no" stage="yes" migrate="no" size="0" >
<read="content" readwrite="content" hard="215822106624" soft="237296943104" prio="5" write="yes" stage="yes" migrate="no" size="0" />
<overflow name="20491-set-writable" />
</replicate>
<region="20576" readonly="fs1a" readwrite="fs1a" upload="yes" download="yes" repl="yes" hard="0" soft"0" prio="0" write="no" stage="yes" migrate="no" size="0" >
<read="content" readwrite="content" hard="215822106624" soft="237296943104" prio="5" write="yes" stage="yes" migrate="no" size="0" />
<overflow name="20576-set-writable" />
</replicate>
</replication>
<user="T:106603" />
<user="T:123879" />
<user="test" />
<user="ele::123456" />
<user="company-temp" />
<user="companymw2" />
<user="bird" />
<user="coding11" />
<user="plazamedia" />
<allow go="123456=abcdefghijklmnopqrstuvwxyz" />
</department>
I wrote a bash like:
awk < test.xml -Fuser= '{ print $2 }' | sed '/^$/d' | cut -d" " -f1
and result is something like:
"T:106603"
"T:123879"
"test"
"ele::123456"
"company-temp"
"companymw2"
"bird"
"coding11"
"plazamedia"
But imagine the result is:
"T:106603" />
"T:123879" />
"test" />
"ele::123456" />
"company-temp" />
"companymw2" />
"bird" />
"coding11" />
"plazamedia" />
first,How can I say remove every thing after second "?
secondly, how can I say extract everything between " "?
I like doing it with sed or awk
Thank you in advance
Try this:
awk -F'"' '/<user=/{ print $2 }' file
Using only sed:
$ sed 's/^<user=\(.*"\).*/\1/' test.xml # With quotes
$ sed 's/^<user="\(.*\)".*/\1/' test.xml # Without quotes
Try this cut,
cut -d'"' -f 2 test.xml
Try this sed,
With quotes("):
sed 's/^.*\("[^"]\+"\).*/\1/g' test.xml
Without quotes("):
sed 's/^.*"\([^"]\+\)".*/\1/g' test.xml
UPDATE:
sed -e '/^<user/!{d}' -e '/^<user/s/^.*"\([^"]\+\)".*/\1/' test.xml
If you want to get rid of the sed and cut in the pipeline, there are many ways to do that, depending on what the corner cases are. The simplest to me would seem to be
awk -F'"' '/<user=/ { print "\"$2\"" }' test.xml
As usual, here's the obligatory don't parse XML with regex link.
Slightly interesting corner cases would be if there can be quoted double quotes in the string (but usually XML would use entities instead) or if the elements can have multiple attributes. If there could be multiple <user=...> elements on a single line, this will quickly become more complex than the proper solution, which is to use XSLT.
Try :
$ awk '/<user=/ && gsub(/<user=|\/>/,x)' file
"T:106603"
"T:123879"
"test"
"ele::123456"
"company-temp"
"companymw2"
"bird"
"coding11"
"plazamedia"
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
Using gnu grep
grep -Po 'user=\K"[^"]*"' file
I've below snippet of xml from my code base:
<property name="myData">
<map>
<entry key="/mycompany/abc">
<value>Mike</value>
</entry>
<entry key="/mycompany/pqr">
<value>John</value>
</entry>
<entry key="/mycompany/xyz">
<value>Sara</value>
</entry>
</map>
</property>
The above snippet is just a portion of XML file. I've an existing shell script that replaces some of the data from the above file.
Now, I need to modify my existing shell script to comment the section as shown below:
<!-- entry key="/mycompany/abc">
<value>Mike</value>
</entry>
<entry key="/mycompany/pqr">
<value>John</value>
</entry -->
Is it possible to comment the above 2 entries to comment via shell script? I can replace any occurrence of with since I've only one such unique occurrence but I'm not able to replace </entry> closing tag if /mycompany/pqr node since all occurrences will get replaced if I try to replace it with </entry -->
Any idea on how to replace this closing node in shell script?
Thanks!
Using an xslt stylesheet like this:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output omit-xml-declaration="no"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/property/map/entry[#key='/mycompany/abc']"/>
<xsl:template match="/property/map/entry[#key='/mycompany/pqr']"/>
</xsl:stylesheet>
Then using the xsltproc xsl processor via the shell script:
$ xsltproc fix.xslt document.xml
which will give you:
<?xml version="1.0"?>
<property name="myData">
<map>
<entry key="/mycompany/xyz">
<value>Sara</value>
</entry>
</map>
</property>
If you really need those nodes commented out then my xslt-foo is not strong enough - you'll probably need <xsl:comment>.
EDIT: A solution with awk:
awk '/<entry key="\/mycompany\/(abc|pqr)">/,/<\/entry>/ {p=1}; /.*/{ if(p==0) {print;}; p=0 }' blah.xml
Result:
<property name="myData">
<map>
<entry key="/mycompany/xyz">
<value>Sara</value>
</entry>
</map>
</property>
Please note that the awk version will not work correctly with nested tags.
-nick
Disclaimer: I think of using awk/sed/... for XML files as a bad idea; if the formatting changes, the line-number between your tags differ, you end up with a bung XML file.
BEGIN{
count=-6
}
{
if( $0 !~ /\/mycompany\/pqr/ && NR != count+5){
print $0
next
}
if( $0 ~ /\/mycompany\/pqr/) {
count=NR;
print gensub( /(entry)/, "!-- \\1", "1" )
}else{
print gensub( /(entry)/, "\\1 --", "1" )
}
}
Save as "something.awk", run like so:
awk -f something.awk your_file.xml