sort fasta by sequence size - python-3.x

I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):
>id1
sequence 1 # could be on several line
>id2
sequence 2
...
I have run a tools that give me in tsv format:
the Identifiant, the length, and the position in bytes of the identifiant.
for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.
# this fonction will get the sequence using seek
def get_seq(file, bites):
with open(file) as f_:
f_.seek(bites, 0) # go to the line of interest
line = f_.readline().strip() # this line is the begin of the
#sequence
to_return = "" # init the string which will contains the sequence
while not line.startswith('>') or not line: # while we do not
# encounter another identifiant
to_return += line
line = f_.readline().strip()
return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):
with open(out_file, 'a') as out_file:
out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))
# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:
indice = 0
for line in ref:
spt = line.split()
id_ = spt[0]
seq = get_seq(args.i, int(spt[2]))
write_seq(out_file=args.out, id_=id_, sequence=seq)
my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?

Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:
subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:
for line in ref:
spt = line.split()
id_ = spt[0]
subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)

What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster.
csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}'
for file in tmp_fasta_file_*
do
TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n")
done
for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}')
do
cat "$filename" >> $2
done
rm tmp_fasta_file*
First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta

Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L option I added, and convert back to fasta:
cat test.fasta \
| tee >(wc -l > nb_lines_fasta.txt) \
| bioawk -c fastx '{l = length($seq); printf "#"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
| tee >(wc -l > nb_lines_fastq.txt) \
| fastq-sort -L \
| tee >(wc -l > nb_lines_fastq_sorted.txt) \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
| tee >(wc -l > nb_lines_fasta_sorted.txt) \
> test_sorted.fasta
The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.
The IIIII... string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads.
In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII... string was too short.
The resulting fasta file will have the shorter sequences first.
To get the longest sequences at the top of the file, add the -r option to fastq-sort.
Note that fastq-sort writes intermediate files in /tmp. If for some reason it is interrupted before erasing them, you may want to clean your /tmp manually and not wait for the next reboot.
Edit
I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself:
cat test.fasta \
| bioawk -c fastx '{print "#"$name"\n"$seq"\n+\n"$seq}' \
| fastq-sort -L \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
> test_sorted.fasta
This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf and the usage of tee to check intermediate data length may be interesting to know about.

You can also do it very conveniently with awk, check the code below:
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -k1,1n | cut -f 2- | tr "\t" "\n"
This and other methods have been posted in Biostars (e.g. using BBMap's sortbyname.sh script), and I strongly recommend this community for questions such like this one.

Related

Using grep -m to save X amount of lines into new zipped file

I have a file that has this pattern (the following text is equivalente to 1 sequence):
#A00479:60:HL5HKDSXX:1:1101:1759:1000 1:N:0:CAGCGTTA
TGAGCCACAGACCCTGGATCCCTCCCTGAGGTCCCATGGGACGGGCAGGCTGGGCATACCTGCAGAGAAGATGTGGCCAGCCACGGCCAGGAACGCATCGGTCACCACAGGCTCAGACTGCAGGGAGATGTGCAGCTGACGCGCCACGTTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
I'd like to use grep to "pick" the first 100 sequences that have the pattern "#" and save that to a new zipped file
I was trying something like this
gzip | grep -m 10 # test_seq_R1.fasta | cat test_seq_R1.fasta > test_seq_R1_zipped
But it is basically returning the same content from the original file test_seq_R1.fasta.
How can I choose the first 100 sequences that initiate with the # pattern and zip it to a new file using grep and gzip?
Thank you
Suggesting an awk script:
awk 'count < 101 && /^#/ {++count; print}' input.txt

how to loop through string for patterns from linux shell?

I have a script that looks through files in a directory for strings like :tagName: which works fine for single :tag: but not for multiple :tagOne:tagTwo:tagThree: tags.
My current script does:
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|.*(:[Aa-Zz]*:)|\1|g' | \
sort -u
printf '\nNote: this fails to display combined :tagOne:tagTwo:etcTag:\n'
The first line is generating an output like this:
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
And the objective is to get that into a list of single :tag:'s.
Again, the problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic.
**I want :tagOne:tagTwo:etcTag: to be turned this into:
:tagOne:
:tagTwo:
:etcTag:
and so forth with :politics:violence: etc.
Colons aren't necessary, tagOne is just as good (maybe better, but this is trivial) than :tagOne:.
The problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic.
So I should replace the sed with something better...
I've tried:
A smarter sed:
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sort -u
...which works (for a limited number of tags) except that it produces weird results like:
:toxicity:p:
:somewhat:y:
:people:n:
...placing weird random letters at the end of some tags in which :p: is the final character of the :leadership: tag and "leadership" no longer appears in the list. Same for :y: and :n:.
I've also tried using loops in a couple ways...
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sort -u | grep lead
...which has the same problem of :leadership: tags being lost etc.
And like...
for m in $(grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd); do
for t in $(echo $m | grep -e ':[Aa-Zz]*:'); do
printf "$t\n";
done
done | sort -u
...which doesn't separate the tags at all, just prints stuff like:
:truama:leadership:business:toxicity
Should I be taking some other approach? Using a different utility (perhaps cut inside a loop)? Maybe doing this in python (I have a few python scripts but don't know the language well, but maybe this would be easy to do that way)? Every time I see awk I think "EEK!" so I'd prefer a non-awk solution please, preferring to stick to paradigms I've used in order to learn them better.
Using PCRE in grep (where available) and positive lookbehind:
$ echo :tagOne:tagTwo:tagThree: | grep -Po "(?<=:)[^:]+:"
tagOne:
tagTwo:
tagThree:
You will lose the leading : but get the tags nevertheless.
Edit: Did someone mention awk?:
$ awk '{
while(match($0,/:[^:]+:/)) {
a[substr($0,RSTART,RLENGTH)]
$0=substr($0,RSTART+1)
}
}
END {
for(i in a)
print i
}' file
Another idea using awk ...
Sample data generated by OPs initial grep:
$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
One awk idea:
awk '
{ split($0,tmp,":") # split input on colon;
# NOTE: fields #1 and #NF are the empty string - see END block
for ( x in tmp ) # loop through tmp[] indices
{ arr[tmp[x]] } # store tmp[] values as arr[] indices; this eliminates duplicates
}
END { delete arr[""] # remove the empty string from arr[]
for ( i in arr ) # loop through arr[] indices
{ printf ":%s:\n", i } # print each tag on separate line leading/trailing colons
}
' tags.raw | sort # sort final output
NOTE: I'm not up to speed on awk's ability to internally sort arrays (thus eliminating the external sort call) so open to suggestions (or someone can copy this answer to a new one and update with said ability?)
The above also generates:
:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:
A pipe through tr can split those strings out to separate lines:
grep -hx -- ':[:[:alnum:]]*:' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
This will also remove the colons and an empty line will be present in the output (easy to repair, note the empty line will always be the first one due to the leading :). Add sort -u to sort and remove duplicates, or awk '!seen[$0]++' to remove duplicates without sorting.
An approach with sed:
sed '/^:/!d;s///;/:$/!d;s///;y/:/\n/' ~/Documents/wiki{,/diary}/*.mkd
This also removes colons, but avoids adding empty lines (by removing the leading/trailing : with s before using y to transliterate remaining : to <newline>). sed could be combined with tr:
sed '/:$/!d;/^:/!d;s///' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
Using awk to work with the : separated fields, removing duplicates:
awk -F: '/^:/ && /:$/ {for (i=2; i<NF; ++i) if (!seen[$i]++) print $i}' \
~/Documents/wiki{,/diary}/*.mkd
Sample data generated by OPs initial grep:
$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
One while/for/printf idea based on associative arrays:
unset arr
typeset -A arr # declare array named 'arr' as associative
while read -r line # for each line from tags.raw ...
do
for word in ${line//:/ } # replace ":" with space and process each 'word' separately
do
arr[${word}]=1 # create/overwrite arr[$word] with value 1;
# objective is to make sure we have a single entry in arr[] for $word;
# this eliminates duplicates
done
done < tags.raw
printf ":%s:\n" "${!arr[#]}" | sort # pass array indices (ie, our unique list of words) to printf;
# per OPs desired output we'll bracket each word with a pair of ':';
# then sort
Per OPs comment/question about removing the array, a twist on the above where we eliminate the array in favor of printing from the internal loop and then piping everything to sort -u:
while read -r line # for each line from tags.raw ...
do
for word in ${line//:/ } # replace ":" with space and process each 'word' separately
do
printf ":%s:\n" "${word}" # print ${word} to stdout
done
done < tags.raw | sort -u # pipe all output (ie, list of ${word}s for sorting and removing dups
Both of the above generates:
:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:

Bash: Loop Read N lines at time from CSV

I have a csv file of 100000 ids
wef7efwe1fwe8
wef7efwe1fwe3
ewefwefwfwgrwergrgr
that are being transformed into a json object using jq
output=$(jq -Rsn '
{"id":
[inputs
| . / "\n"
| (.[] | select(length > 0) | . / ";") as $input
| $input[0]]}
' <$FILE)
output
{
"id": [
"wef7efwe1fwe8",
"wef7efwe1fwe3",
....
]
}
currently, I need to manually split the file into smaller 10000 line files... because the API call has a limit.
I would like a way to automatically loop through the large file... and only use 10000 lines as a time as $FILE... up until the end of the list.
I would use the split command and write a little shell script around it:
#!/bin/bash
input_file=ids.txt
temp_dir=splits
api_limit=10000
# Make sure that there are no leftovers from previous runs
rm -rf "${temp_dir}"
# Create temporary folder for splitting the file
mkdir "${temp_dir}"
# Split the input file based on the api limit
split --lines "${api_limit}" "${input_file}" "${temp_dir}/"
# Iterate through splits and make an api call per split
for split in "${temp_dir}"/* ; do
jq -Rsn '
{"id":
[inputs
| . / "\n"
| (.[] | select(length > 0) | . / ";") as $input
| $input[0]]
}' "${split}" > api_payload.json
# now do something ...
# curl -dapi_payload.json http://...
rm -f api_payload.json
done
# Clean up
rm -rf "${temp_dir}"
Here's a simple and efficient solution that at its core just uses jq. It takes advantage of the -c command-line option. I've used xargs printf ... for illustration - mainly to show how easy it is to set up a shell pipeline.
< data.txt jq -Rnc '
def batch($n; stream):
def b: [limit($n; stream)]
| select(length > 0)
| (., b);
b;
{id: batch(10000; inputs | select(length>0) | (. / ";")[0])}
' | xargs printf "%s\n"
Parameterizing batch size
It might make sense to set things up so that the batch size is specified outside the jq program. This could be done in numerous ways, e.g. by invoking jq along the lines of:
jq --argjson n 10000 ....
and of course using $n instead of 10000 in the jq program.
Why “def b:”?
For efficiency. jq’s TCO (tail recursion optimization) only works for arity-0 filters.
Note on -s
In the Q as originally posted, the command-line options -sn are used in conjunction with inputs. Using -s with inputs defeats the whole purpose of inputs, which is to make it possible to process input in a stream-oriented way (i.e. one line of input or one JSON entity at a time).

Interleave lines sorted by a column

(Similar to How to interleave lines from two text files but for a single input. Also similar to Sort lines by group and column but interleaving or randomizing versus sorting.)
I have a set of systems and tasks in two columns, SYSTEM,TASK:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850
I want to distribute the tasks to each system in a balanced way. The ideal case where each system has the same number of tasks would be round-robin, one alpha then one beta then one gamma and repeat until finished.
I get the whole list of tasks + systems at once, so I don't need to keep any state
The list of systems is not static, on the order of N=100
The total number of tasks is variable, on the order of N=500
The number of tasks for each system is not guaranteed to be equal
Hard / absolute interleaving isn't required, as long as there aren't two of the same system twice in a row
The same task may show up more than once, but not for the same system
Input format / delimiter can be changed
I can solve this well enough with some fancy scripting to split the data into multiple files (grep ^alpha, input > alpha.txt etc) and then recombine them with paste or similar, but I'd like to use a single command or set of pipes to run it without intermediate files or a proper scripting language. Just using sort -R gets me 95% of the way there, but I end up with 2 tasks for the same system in a row almost every time, and sometimes 3 or more depending on the initial distribution.
edit:
To clarify, any output should not have the same system on two lines in a row. All system,task pairs must be preserved, you can't move a task from one system to another - that'd make this really easy!
One of several possible sample outputs:
beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909
We start with by answering the underlying theoretical problem. The problem is not as simple as it seems. Feel free to implement a script based on this answer.
The blocks formatted as quotes are not quotes. I just wanted to highlight them to improve navigation in this rather long answer.
Theoretical Problem
Given a finite set of letters L with frequencies f : L→ℕ0, find a sequence of letters such that every letter ℓ appears exactly f(ℓ) times and adjacent elements of the sequence are always different.
Example
L = {a,b,c} with f(a)=4, f(b)=2, f(c)=1
ababaca, acababa, and abacaba are all valid solutions.
aaaabbc is invalid – Some adjacent elements are equal, for instance aa or bb.
ababac is invalid – The letter a appears 3 times, but its frequency is f(a)=4
cababac is invalid – The letter c appears 2 times, but its frequency is f(c)=1
Solution
The following approach produces a valid sequence if and only if there exists a solution.
Sort the letters by their frequencies.
For ease of notation we assume, without loss of generality, that f(a) ≥ f(b) ≥ f(c) ≥ ... ≥ 0.
Note: There exists a solution if and only if f(a) ≤ 1 + ∑ℓ≠a f(ℓ).
Write down a sequence s of f(a) many a.
Add the remaining letters into a FIFO working list, that is:
(Don't add any a)
First add f(b) many b
Then f(c) many c
and so on
Iterate from left to right over the sequence s and insert after each element a letter from the working list. Repeat this step until the working list is empty.
Example
L = {a,b,c,d} with f(a)=5, f(b)=5, f(c)=4, f(d)=2
The letters are already sorted by their frequencies.
s = aaaaa
workinglist = bbbbbccccdd. The leftmost entry is the first one.
We iterate from left to right. The places where we insert letters from the working list are marked with an _ underscore.
s = a_a_a_a_a_ workinglist = bbbbbccccdd
s = aba_a_a_a_ workinglist = bbbbccccdd
s = ababa_a_a_ workinglist = bbbccccdd
...
s = ababababab workinglist = ccccdd
⚠️ We reached the end of sequence s. We repeat step 4.
s = a_b_a_b_a_b_a_b_a_b_ workinglist = ccccdd
s = acb_a_b_a_b_a_b_a_b_ workinglist = cccdd
...
s = acbcacb_a_b_a_b_a_b_ workinglist = cdd
s = acbcacbca_b_a_b_a_b_ workinglist = dd
s = acbcacbcadb_a_b_a_b_ workinglist = d
s = acbcacbcadbda_b_a_b_ workinglist =
⚠️ The working list is empty. We stop.
The final sequence is acbcacbcadbdabab.
Implementation In Bash
Here is a bash implementation of the proposed approach that works with your input format. Instead of using a working list each line is labeled with a binary floating point number specifying the position of that line in the final sequence. Then the lines are sorted by their labels. That way we don't have to use explicit loops. Intermediate results are stored in variables. No files are created.
#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"
inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"
maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"
seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"
paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6
This solution could fail for very long input files due to the limited precision of floating point numbers in seq and awk.
Well, this is what I've come up with:
args=()
while IFS=' ' read -r _ name; do
# add a file redirection with grepped certain SYSTEM only for later eval
args+=("<(grep '^$name,' file)")
done < <(
# extract SYSTEM only
<file cut -d, -f1 |
#sort with the count
sort | uniq -c | sort -nr
)
# this is actually safe, because we control all arguments
eval paste -d "'\\n'" "${args[#]}" |
# paste will insert empty lines when the list ended - remove them
sed '/^$/d'
First, I extract and sort the SYSTEM names in the order which occurs the most often to be first. So for the input example we get:
4 beta
3 alpha
1 gamme
Then for each such name I add the proper string <(grep '...' file) to arguments list witch will be later evalulated.
Then I evalulate the call to paste <(grep ...) <(grep ...) <(grep ...) ... with newline as the paste's delimeter. I remove empty lines with simple sed call.
The output for the input provided:
beta,21700055
alpha,90198500
gamma,64910850
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
Converted to a fancy oneliner, with substituting the while read with command substitution and sed. Got safe with inputfile naming with printf "%q" "$inputfile" and double quoting inside sed regex.
inputfile="file"
fieldsep=","
eval paste -d '"\\n"' "$(
cut -d "$fieldsep" -f1 "$inputfile" |
sort | uniq -c | sort -nr |
sed 's/^[[:space:]]*[0-9]\+[[:space:]]*\(.*\)$/<(grep '\''^\1'"$fieldsep"\'' "'"$(printf "%q" "$inputfile")"'")/' |
tr '\n' ' '
)" |
sed '/^$/d'
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
}
To accommodate for strange input values this script implements somewhat simple but hardy in bash statemachine.
The variable counts stores SYSTEM names with their're occurrence count. So from the example input it will be
4 alpha
3 beta
1 gamma
Now - we output the SYSTEM name with the biggest occurrence count that is also different from the last outputted SYSTEM name. We decrement it's occurrence count. If the count is equal to zero, it is removed from the list. We remember the last outputted SYSTEM name. We repeat this process until all occurrence counts reach zero, so the list is empty. For the example input this will output:
beta
alpha
beta
alpha
beta
alpha
beta
gamma
Now, we need to join that list with the job names. We can't use join as the input is not sorted and we don't want to change the ordering. So what I do, I get only SYSTEM names in system. Then for each system I open a different file descriptor with filtered only that SYSTEM name from the input file. All the file descriptors are stored in an array. Then for each SYSTEM name from the input, I find the file descriptor that filters that SYSTEM name from the input file and read exactly one line from the file descriptor. This works like an array of file positions each file position associated / filtering specified SYSTEM name.
beta,21700055
alpha,90198500
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
gamma,64910850
The script was done so for the input in the form of:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
gamma,64910850
the script outputs correctly:
alpha,90198500
gamma,64910850
alpha,93082105
beta,21700055
alpha,30184438
I think this algorithm will mostly always print correct output, but the ordering is so that the least common SYSTEMs will be outputted last, which may be not optimal.
Tested manually with some custom tests and checker on paiza.io.
inputfile="inputfile"
in=( 1 2 1 5 )
cat <<EOF > "$inputfile"
$(seq ${in[0]} | sed 's/^/A,/' )
$(seq ${in[1]} | sed 's/^/B,/' )
$(seq ${in[2]} | sed 's/^/C,/' )
$(seq ${in[3]} | sed 's/^/D,/' )
EOF
sed -i -e '/^$/d' "$inputfile"
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# I think this holds true
# The SYSTEM with the most count should be lower than the sum of all others
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
} |
{
# check if the output is correct
output=$(cat)
# output should have same lines as inputfile
if ! cmp <(sort "$inputfile") <(<<<"$output" sort); then
echo "Output does not match input!" >&2
exit 1
fi
# two consecutive lines can't have the same system
lastsys=""
<<<"$output" cut -d, -f1 |
while IFS= read -r sys; do
if [ -n "$lastsys" -a "$lastsys" = "$sys" ]; then
echo "Same systems found on two consecutive lines!" >&2
exit 1
fi
lastsys="$sys"
done
# all ok
echo "all ok!"
echo -------------
printf "%s\n" "$output"
}
exit

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

Resources