parsing and replacing some strings in two files - linux

I want to run a shell script with this usage:
./run A.txt B.xml
A.txt contain some statistic:
Accesses = 1
Hits = 2
Misses = 3
Evictions = 4
Retries = 5
B.xml looks like:
<stat name="total_accesses" value="0"/>
<stat name="total_misses" value="0"/>
<stat name="conflicts" value="0"/>
I want to replace some stats in B.xml from A.txt. For example, I want to
1- find "Accesses" in A.txt
2- find "total_accesses" in B.xml
3- replace 0 with 1
1- find "Misses" in A.txt
2- find "total_misses" in B.xml
3- replace 0 with 3
So B.xml will look like:
<stat name="total_accesses" value="1"/>
<stat name="total_misses" value="3"/>
<stat name="conflicts" value="0"/>
I want to do that with shell "sed" command. However I find it quite complex as the regexp is hard to understand.
Does "sed" help me in this problem or I have to find another way?

It might be a bit heavy-weight for such a simple case, but here's a Python script that does the job:
#!/usr/bin/env python
import sys
import xml.etree.ElementTree as etree
# read A.txt; fill stats
stats = {}
for line in open(sys.argv[1]):
if line.strip():
name, _, count = line.partition('=')
stats["total_"+name.lower().strip()] = count.strip()
# read B.xml; fix to make it a valid xml; replace stat[#value]
root = etree.fromstring("<root>%s</root>" % open(sys.argv[2]).read())
for s in root:
if s.get('name') in stats:
s.set('value', stats[s.get('name')])
print etree.tostring(s),
Example
$ python fill-xml-template.py A.txt B.xml
<stat name="total_accesses" value="1" />
<stat name="total_misses" value="3" />
<stat name="conflicts" value="0" />
To process input files incrementally or to makes changes inplace you could use the following:
#!/usr/bin/env python
import fileinput
import sys
import xml.etree.ElementTree as etree
try: sys.argv.remove('-i')
except ValueError:
inplace = False
else: inplace = True # make changes inplace if `-i` option is specified
# read A.txt; fill stats
stats = {}
for line in open(sys.argv.pop(1)):
if line.strip():
name, _, count = line.partition('=')
stats["total_"+name.lower().strip()] = count.strip()
# read input; replace stat[#value]
for line in fileinput.input(inplace=inplace):
s = etree.fromstring(line)
if s.get('name') in stats:
s.set('value', stats[s.get('name')])
print etree.tostring(s)
Example
$ python fill-xml-template.py A.txt B.xml -i
It can read from stdin or process several files:
$ cat B.xml | python fill-xml-template.py A.txt
<stat name="total_accesses" value="1" />
<stat name="total_misses" value="3" />
<stat name="conflicts" value="0" />

Here is a shell script that does what you want:
#!/bin/bash
while read line
do
key=`echo $line | cut -d' ' -f1`
value=`echo $line | cut -d' ' -f3`
xmlLine=`grep -i $key $2`
if [ -n "$xmlLine" ]; then
for num in `seq 5`
do
field[${num}]=`echo "$xmlLine" | cut -d'"' -f${num}`
done
echo ${field[1]}\"${field[2]}\"${field[3]}\"$value\"${field[5]}
fi
done
You can copy it to a file say A.sh , give run permissions to it (chmod +x A.sh) and then:
./A.sh A.txt B.xml
Please mind that this code is not suitable for production and regex is paramount for these scripts.

while you can hack this on the command line, I'd recommend not to do this.
XML is way too fragile to be handled this way - use a proper XML library and parse the XML before manipulating it. Otherwise you could easily end up with broken XML. e.g. write a script in Ruby, Python, or Perl and use an XML library.

Related

How to split a delimited string to compose a dd command in bash?

I would like to read a config file that should look similar to what is shown below:
source/path:blocksize,offset,seek,count
source/path2:blocksize,offset,seek
source/path3:blocksize,offset
Where source/path,source/path2 and source/path3 are paths to some binary file and offset, seek, count and blocksize are respective values for dd command.
Note that the variables may vary, like some binary file may not have seek or both seek and count values for dd command.
How should I split the above lines to compose dd command like this
dd if=${source/path} bs=${blocksize} seek=${seek} count=${count}
dd if=${source/path} bs=${blocksize} seek=${seek}
dd if=${source/path} bs=${blocksize}?
It is ok if modification is required in the above format to make it easy for parsing cause I ran out of all possibilities that my naive mind can think of.
Hope this helps:
$ cat <<EOF | while read line; do arr=($(sed 's/[,:]/ /g' <<< $line)); echo "source:${arr[0]} block:${arr[1]} offset:${arr[2]} seek:${arr[3]} count:${arr[4]}"; done
source/path:blocksize,offset,seek,count
source/path2:blocksize,offset,seek
source/path3:blocksize,offset
EOF
source:source/path block:blocksize offset:offset seek:seek count:count
source:source/path2 block:blocksize offset:offset seek:seek count:
source:source/path3 block:blocksize offset:offset seek: count:
General Idea:
#!/usr/bin/env bash
your_command | while read line; do
arr=($(sed 's/[,:]/ /g' <<< $line));
echo "source:${arr[0]} block:${arr[1]} offset:${arr[2]} seek:${arr[3]} count:${arr[4]}"
# Do whatever processing & validation you want here
# access from array : ${arr[0]}....${arr[n]}
#
done
If you're having file then:
#!/usr/bin/env bash
while read line; do
arr=($(sed 's/[,:]/ /g' <<< $line));
echo "source:${arr[0]} block:${arr[1]} offset:${arr[2]} seek:${arr[3]} count:${arr[4]}"
# Do whatever processing & validation you want here
# access from array : ${arr[0]}....${arr[n]}
#
done < "path/to/your-file"

Replace \n with <br /> in bash

[UPDATED QUESTION]
I've got a variable $CHANGED which stores the output of a subversion command like this: CHANGED="$(svnlook changed -r $REV $REPOS)".
Executing svnlook changed -r $REV $REPOS will output the following to the command line:
A /path/to/file
A /path/to/file2
A /path/to/file3
However, I need to store the output formatted as shown below in a variable $FILES:
A /path/to/file<br />A /path/to/file2<br />A /path/to/file3<br />
I need this for using $FILES in a command which generates an email massage like this:
sendemail [some-options] $FILES
It should to replace $FILES with A /path/to/file<br />A /path/to/file2<br />A /path/to/file3<br /> so that it can interpret the html break tags.
In bash:
echo "${VAR//$'\n'/<br />}"
See Parameter Expansion
The Parameter Expansion section of the man page is your friend.
Starting with
changed="
A /path/to/file
A /path/to/other/file
A /path/to/new/file
"
You can remove leading and trailing newlines using the # and % expansions:
files="${changed#$'\n'}"
files="${files%$'\n'}"
Then replace the other newlines with <br />:
files="${files//$'\n'/<br />}"
Demonstration:
printf '***%s***\n' "$files"
***A /path/to/file<br />A /path/to/other/file<br />A /path/to/new/file***
(Note that I've changed your all-uppercase variable names to lower case. Avoid uppercase names for your locals, as these tend to be used for communication via the environment.)
If you dislike writing newline as $'\n', you may of course store it in a variable:
nl=$'\n'
files="${changed#$nl}"
files="${files%$nl}"
files="${files//$nl/<br />}"
You can modify hek2mgl's answer to strip out the first <br /> (if any):
CHANGED="
A /path/to/file
A /path/to/other/file
A /path/to/new/file
"
FILES="$(echo "${CHANGED//$'\n'/<br />}" | sed 's#^<br />##g')"
echo "$FILES"
Output:
A /path/to/file<br />A /path/to/other/file<br />A /path/to/new/file<br />
Another way (with only sed):
FILES="$(echo "$CHANGED" | sed ':a;N;$!ba;s#\n#<br />#g;s#^<br />##g')"

sort fasta by sequence size

I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):
>id1
sequence 1 # could be on several line
>id2
sequence 2
...
I have run a tools that give me in tsv format:
the Identifiant, the length, and the position in bytes of the identifiant.
for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.
# this fonction will get the sequence using seek
def get_seq(file, bites):
with open(file) as f_:
f_.seek(bites, 0) # go to the line of interest
line = f_.readline().strip() # this line is the begin of the
#sequence
to_return = "" # init the string which will contains the sequence
while not line.startswith('>') or not line: # while we do not
# encounter another identifiant
to_return += line
line = f_.readline().strip()
return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):
with open(out_file, 'a') as out_file:
out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))
# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:
indice = 0
for line in ref:
spt = line.split()
id_ = spt[0]
seq = get_seq(args.i, int(spt[2]))
write_seq(out_file=args.out, id_=id_, sequence=seq)
my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?
Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:
subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:
for line in ref:
spt = line.split()
id_ = spt[0]
subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)
What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster.
csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}'
for file in tmp_fasta_file_*
do
TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n")
done
for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}')
do
cat "$filename" >> $2
done
rm tmp_fasta_file*
First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta
Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L option I added, and convert back to fasta:
cat test.fasta \
| tee >(wc -l > nb_lines_fasta.txt) \
| bioawk -c fastx '{l = length($seq); printf "#"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
| tee >(wc -l > nb_lines_fastq.txt) \
| fastq-sort -L \
| tee >(wc -l > nb_lines_fastq_sorted.txt) \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
| tee >(wc -l > nb_lines_fasta_sorted.txt) \
> test_sorted.fasta
The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.
The IIIII... string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads.
In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII... string was too short.
The resulting fasta file will have the shorter sequences first.
To get the longest sequences at the top of the file, add the -r option to fastq-sort.
Note that fastq-sort writes intermediate files in /tmp. If for some reason it is interrupted before erasing them, you may want to clean your /tmp manually and not wait for the next reboot.
Edit
I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself:
cat test.fasta \
| bioawk -c fastx '{print "#"$name"\n"$seq"\n+\n"$seq}' \
| fastq-sort -L \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
> test_sorted.fasta
This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf and the usage of tee to check intermediate data length may be interesting to know about.
You can also do it very conveniently with awk, check the code below:
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -k1,1n | cut -f 2- | tr "\t" "\n"
This and other methods have been posted in Biostars (e.g. using BBMap's sortbyname.sh script), and I strongly recommend this community for questions such like this one.

how to delete line after specific pattern and extract something

UPDATE
This is my file:
<department name="/fighters" id="123879" group="channel" case="none" use="no">
<options index_name="index.html" listing="0" sum="no" allowed="no" />
<target prefix="ttp" suffix=".net" />
<type="effort">
<region="20491" readonly="fs1a" readwrite="fs1a" upload="yes" download="yes" repl="yes" hard="0" soft"0" prio="0" write="no" stage="yes" migrate="no" size="0" >
<read="content" readwrite="content" hard="215822106624" soft="237296943104" prio="5" write="yes" stage="yes" migrate="no" size="0" />
<overflow name="20491-set-writable" />
</replicate>
<region="20576" readonly="fs1a" readwrite="fs1a" upload="yes" download="yes" repl="yes" hard="0" soft"0" prio="0" write="no" stage="yes" migrate="no" size="0" >
<read="content" readwrite="content" hard="215822106624" soft="237296943104" prio="5" write="yes" stage="yes" migrate="no" size="0" />
<overflow name="20576-set-writable" />
</replicate>
</replication>
<user="T:106603" />
<user="T:123879" />
<user="test" />
<user="ele::123456" />
<user="company-temp" />
<user="companymw2" />
<user="bird" />
<user="coding11" />
<user="plazamedia" />
<allow go="123456=abcdefghijklmnopqrstuvwxyz" />
</department>
I wrote a bash like:
awk < test.xml -Fuser= '{ print $2 }' | sed '/^$/d' | cut -d" " -f1
and result is something like:
"T:106603"
"T:123879"
"test"
"ele::123456"
"company-temp"
"companymw2"
"bird"
"coding11"
"plazamedia"
But imagine the result is:
"T:106603" />
"T:123879" />
"test" />
"ele::123456" />
"company-temp" />
"companymw2" />
"bird" />
"coding11" />
"plazamedia" />
first,How can I say remove every thing after second "?
secondly, how can I say extract everything between " "?
I like doing it with sed or awk
Thank you in advance
Try this:
awk -F'"' '/<user=/{ print $2 }' file
Using only sed:
$ sed 's/^<user=\(.*"\).*/\1/' test.xml # With quotes
$ sed 's/^<user="\(.*\)".*/\1/' test.xml # Without quotes
Try this cut,
cut -d'"' -f 2 test.xml
Try this sed,
With quotes("):
sed 's/^.*\("[^"]\+"\).*/\1/g' test.xml
Without quotes("):
sed 's/^.*"\([^"]\+\)".*/\1/g' test.xml
UPDATE:
sed -e '/^<user/!{d}' -e '/^<user/s/^.*"\([^"]\+\)".*/\1/' test.xml
If you want to get rid of the sed and cut in the pipeline, there are many ways to do that, depending on what the corner cases are. The simplest to me would seem to be
awk -F'"' '/<user=/ { print "\"$2\"" }' test.xml
As usual, here's the obligatory don't parse XML with regex link.
Slightly interesting corner cases would be if there can be quoted double quotes in the string (but usually XML would use entities instead) or if the elements can have multiple attributes. If there could be multiple <user=...> elements on a single line, this will quickly become more complex than the proper solution, which is to use XSLT.
Try :
$ awk '/<user=/ && gsub(/<user=|\/>/,x)' file
"T:106603"
"T:123879"
"test"
"ele::123456"
"company-temp"
"companymw2"
"bird"
"coding11"
"plazamedia"
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
Using gnu grep
grep -Po 'user=\K"[^"]*"' file

Searching for text

I'm trying to write a shell script that searches for text within a file and prints out the text and associated information to a separate file.
From this file containing list of gene IDs:
DDIT3 ENSG00000175197
DNMT1 ENSG00000129757
DYRK1B ENSG00000105204
I want to search for these gene IDs (ENSG*), their RPKM1 and RPKM2 values in a gtf file:
chr16 gencodeV7 gene 88772891 88781784 0.126744 + . gene_id "ENSG00000174177.7"; transcript_ids "ENST00000453996.1,ENST00000312060.4,ENST00000378384.3,"; RPKM1 "1.40735"; RPKM2 "1.61345"; iIDR "0.003";
chr11 gencodeV7 gene 55850277 55851215 0.000000 + . gene_id "ENSG00000225538.1"; transcript_ids "ENST00000425977.1,"; RPKM1 "0"; RPKM2 "0"; iIDR "NA";
and print/ write it to a separate output file
Gene_ID RPKM1 RPKM2
ENSG00000108270 7.81399 8.149
ENSG00000101126 12.0082 8.55263
I've done it on the command line using for each ID using:
grep -w "ENSGno" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' > output.file
but when it comes to writing the shell script, I've tried various combinations of for, while, read, do and changing the variables but without success. Any ideas would be great!
You can do something like:
while read line
do
var=$(echo $line | awk '{print $2}')
grep -w "$var" rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' >> output.file
done < geneIDs.file

Resources