Interleave lines sorted by a column - linux
(Similar to How to interleave lines from two text files but for a single input. Also similar to Sort lines by group and column but interleaving or randomizing versus sorting.)
I have a set of systems and tasks in two columns, SYSTEM,TASK:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850
I want to distribute the tasks to each system in a balanced way. The ideal case where each system has the same number of tasks would be round-robin, one alpha then one beta then one gamma and repeat until finished.
I get the whole list of tasks + systems at once, so I don't need to keep any state
The list of systems is not static, on the order of N=100
The total number of tasks is variable, on the order of N=500
The number of tasks for each system is not guaranteed to be equal
Hard / absolute interleaving isn't required, as long as there aren't two of the same system twice in a row
The same task may show up more than once, but not for the same system
Input format / delimiter can be changed
I can solve this well enough with some fancy scripting to split the data into multiple files (grep ^alpha, input > alpha.txt etc) and then recombine them with paste or similar, but I'd like to use a single command or set of pipes to run it without intermediate files or a proper scripting language. Just using sort -R gets me 95% of the way there, but I end up with 2 tasks for the same system in a row almost every time, and sometimes 3 or more depending on the initial distribution.
edit:
To clarify, any output should not have the same system on two lines in a row. All system,task pairs must be preserved, you can't move a task from one system to another - that'd make this really easy!
One of several possible sample outputs:
beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909
We start with by answering the underlying theoretical problem. The problem is not as simple as it seems. Feel free to implement a script based on this answer.
The blocks formatted as quotes are not quotes. I just wanted to highlight them to improve navigation in this rather long answer.
Theoretical Problem
Given a finite set of letters L with frequencies f : L→ℕ0, find a sequence of letters such that every letter ℓ appears exactly f(ℓ) times and adjacent elements of the sequence are always different.
Example
L = {a,b,c} with f(a)=4, f(b)=2, f(c)=1
ababaca, acababa, and abacaba are all valid solutions.
aaaabbc is invalid – Some adjacent elements are equal, for instance aa or bb.
ababac is invalid – The letter a appears 3 times, but its frequency is f(a)=4
cababac is invalid – The letter c appears 2 times, but its frequency is f(c)=1
Solution
The following approach produces a valid sequence if and only if there exists a solution.
Sort the letters by their frequencies.
For ease of notation we assume, without loss of generality, that f(a) ≥ f(b) ≥ f(c) ≥ ... ≥ 0.
Note: There exists a solution if and only if f(a) ≤ 1 + ∑ℓ≠a f(ℓ).
Write down a sequence s of f(a) many a.
Add the remaining letters into a FIFO working list, that is:
(Don't add any a)
First add f(b) many b
Then f(c) many c
and so on
Iterate from left to right over the sequence s and insert after each element a letter from the working list. Repeat this step until the working list is empty.
Example
L = {a,b,c,d} with f(a)=5, f(b)=5, f(c)=4, f(d)=2
The letters are already sorted by their frequencies.
s = aaaaa
workinglist = bbbbbccccdd. The leftmost entry is the first one.
We iterate from left to right. The places where we insert letters from the working list are marked with an _ underscore.
s = a_a_a_a_a_ workinglist = bbbbbccccdd
s = aba_a_a_a_ workinglist = bbbbccccdd
s = ababa_a_a_ workinglist = bbbccccdd
...
s = ababababab workinglist = ccccdd
⚠️ We reached the end of sequence s. We repeat step 4.
s = a_b_a_b_a_b_a_b_a_b_ workinglist = ccccdd
s = acb_a_b_a_b_a_b_a_b_ workinglist = cccdd
...
s = acbcacb_a_b_a_b_a_b_ workinglist = cdd
s = acbcacbca_b_a_b_a_b_ workinglist = dd
s = acbcacbcadb_a_b_a_b_ workinglist = d
s = acbcacbcadbda_b_a_b_ workinglist =
⚠️ The working list is empty. We stop.
The final sequence is acbcacbcadbdabab.
Implementation In Bash
Here is a bash implementation of the proposed approach that works with your input format. Instead of using a working list each line is labeled with a binary floating point number specifying the position of that line in the final sequence. Then the lines are sorted by their labels. That way we don't have to use explicit loops. Intermediate results are stored in variables. No files are created.
#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"
inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"
maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"
seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"
paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6
This solution could fail for very long input files due to the limited precision of floating point numbers in seq and awk.
Well, this is what I've come up with:
args=()
while IFS=' ' read -r _ name; do
# add a file redirection with grepped certain SYSTEM only for later eval
args+=("<(grep '^$name,' file)")
done < <(
# extract SYSTEM only
<file cut -d, -f1 |
#sort with the count
sort | uniq -c | sort -nr
)
# this is actually safe, because we control all arguments
eval paste -d "'\\n'" "${args[#]}" |
# paste will insert empty lines when the list ended - remove them
sed '/^$/d'
First, I extract and sort the SYSTEM names in the order which occurs the most often to be first. So for the input example we get:
4 beta
3 alpha
1 gamme
Then for each such name I add the proper string <(grep '...' file) to arguments list witch will be later evalulated.
Then I evalulate the call to paste <(grep ...) <(grep ...) <(grep ...) ... with newline as the paste's delimeter. I remove empty lines with simple sed call.
The output for the input provided:
beta,21700055
alpha,90198500
gamma,64910850
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
Converted to a fancy oneliner, with substituting the while read with command substitution and sed. Got safe with inputfile naming with printf "%q" "$inputfile" and double quoting inside sed regex.
inputfile="file"
fieldsep=","
eval paste -d '"\\n"' "$(
cut -d "$fieldsep" -f1 "$inputfile" |
sort | uniq -c | sort -nr |
sed 's/^[[:space:]]*[0-9]\+[[:space:]]*\(.*\)$/<(grep '\''^\1'"$fieldsep"\'' "'"$(printf "%q" "$inputfile")"'")/' |
tr '\n' ' '
)" |
sed '/^$/d'
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
}
To accommodate for strange input values this script implements somewhat simple but hardy in bash statemachine.
The variable counts stores SYSTEM names with their're occurrence count. So from the example input it will be
4 alpha
3 beta
1 gamma
Now - we output the SYSTEM name with the biggest occurrence count that is also different from the last outputted SYSTEM name. We decrement it's occurrence count. If the count is equal to zero, it is removed from the list. We remember the last outputted SYSTEM name. We repeat this process until all occurrence counts reach zero, so the list is empty. For the example input this will output:
beta
alpha
beta
alpha
beta
alpha
beta
gamma
Now, we need to join that list with the job names. We can't use join as the input is not sorted and we don't want to change the ordering. So what I do, I get only SYSTEM names in system. Then for each system I open a different file descriptor with filtered only that SYSTEM name from the input file. All the file descriptors are stored in an array. Then for each SYSTEM name from the input, I find the file descriptor that filters that SYSTEM name from the input file and read exactly one line from the file descriptor. This works like an array of file positions each file position associated / filtering specified SYSTEM name.
beta,21700055
alpha,90198500
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
gamma,64910850
The script was done so for the input in the form of:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
gamma,64910850
the script outputs correctly:
alpha,90198500
gamma,64910850
alpha,93082105
beta,21700055
alpha,30184438
I think this algorithm will mostly always print correct output, but the ordering is so that the least common SYSTEMs will be outputted last, which may be not optimal.
Tested manually with some custom tests and checker on paiza.io.
inputfile="inputfile"
in=( 1 2 1 5 )
cat <<EOF > "$inputfile"
$(seq ${in[0]} | sed 's/^/A,/' )
$(seq ${in[1]} | sed 's/^/B,/' )
$(seq ${in[2]} | sed 's/^/C,/' )
$(seq ${in[3]} | sed 's/^/D,/' )
EOF
sed -i -e '/^$/d' "$inputfile"
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# I think this holds true
# The SYSTEM with the most count should be lower than the sum of all others
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
} |
{
# check if the output is correct
output=$(cat)
# output should have same lines as inputfile
if ! cmp <(sort "$inputfile") <(<<<"$output" sort); then
echo "Output does not match input!" >&2
exit 1
fi
# two consecutive lines can't have the same system
lastsys=""
<<<"$output" cut -d, -f1 |
while IFS= read -r sys; do
if [ -n "$lastsys" -a "$lastsys" = "$sys" ]; then
echo "Same systems found on two consecutive lines!" >&2
exit 1
fi
lastsys="$sys"
done
# all ok
echo "all ok!"
echo -------------
printf "%s\n" "$output"
}
exit
Related
how to loop through string for patterns from linux shell?
I have a script that looks through files in a directory for strings like :tagName: which works fine for single :tag: but not for multiple :tagOne:tagTwo:tagThree: tags. My current script does: grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \ sed -r 's|.*(:[Aa-Zz]*:)|\1|g' | \ sort -u printf '\nNote: this fails to display combined :tagOne:tagTwo:etcTag:\n' The first line is generating an output like this: :politics:violence: :positivity: :positivity:somewhat: :psychology: :socialServices:family: :strategy: :tech: :therapy:babylon: :trauma: :triggered: :truama:leadership:business:toxicity: :unfurling: :tagOne:tagTwo:etcTag: And the objective is to get that into a list of single :tag:'s. Again, the problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic. **I want :tagOne:tagTwo:etcTag: to be turned this into: :tagOne: :tagTwo: :etcTag: and so forth with :politics:violence: etc. Colons aren't necessary, tagOne is just as good (maybe better, but this is trivial) than :tagOne:. The problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic. So I should replace the sed with something better... I've tried: A smarter sed: grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \ sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \ sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \ sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \ sort -u ...which works (for a limited number of tags) except that it produces weird results like: :toxicity:p: :somewhat:y: :people:n: ...placing weird random letters at the end of some tags in which :p: is the final character of the :leadership: tag and "leadership" no longer appears in the list. Same for :y: and :n:. I've also tried using loops in a couple ways... grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \ sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \ sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \ sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \ sort -u | grep lead ...which has the same problem of :leadership: tags being lost etc. And like... for m in $(grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd); do for t in $(echo $m | grep -e ':[Aa-Zz]*:'); do printf "$t\n"; done done | sort -u ...which doesn't separate the tags at all, just prints stuff like: :truama:leadership:business:toxicity Should I be taking some other approach? Using a different utility (perhaps cut inside a loop)? Maybe doing this in python (I have a few python scripts but don't know the language well, but maybe this would be easy to do that way)? Every time I see awk I think "EEK!" so I'd prefer a non-awk solution please, preferring to stick to paradigms I've used in order to learn them better.
Using PCRE in grep (where available) and positive lookbehind: $ echo :tagOne:tagTwo:tagThree: | grep -Po "(?<=:)[^:]+:" tagOne: tagTwo: tagThree: You will lose the leading : but get the tags nevertheless. Edit: Did someone mention awk?: $ awk '{ while(match($0,/:[^:]+:/)) { a[substr($0,RSTART,RLENGTH)] $0=substr($0,RSTART+1) } } END { for(i in a) print i }' file
Another idea using awk ... Sample data generated by OPs initial grep: $ cat tags.raw :politics:violence: :positivity: :positivity:somewhat: :psychology: :socialServices:family: :strategy: :tech: :therapy:babylon: :trauma: :triggered: :truama:leadership:business:toxicity: :unfurling: :tagOne:tagTwo:etcTag: One awk idea: awk ' { split($0,tmp,":") # split input on colon; # NOTE: fields #1 and #NF are the empty string - see END block for ( x in tmp ) # loop through tmp[] indices { arr[tmp[x]] } # store tmp[] values as arr[] indices; this eliminates duplicates } END { delete arr[""] # remove the empty string from arr[] for ( i in arr ) # loop through arr[] indices { printf ":%s:\n", i } # print each tag on separate line leading/trailing colons } ' tags.raw | sort # sort final output NOTE: I'm not up to speed on awk's ability to internally sort arrays (thus eliminating the external sort call) so open to suggestions (or someone can copy this answer to a new one and update with said ability?) The above also generates: :babylon: :business: :etcTag: :family: :leadership: :politics: :positivity: :psychology: :socialServices: :somewhat: :strategy: :tagOne: :tagTwo: :tech: :therapy: :toxicity: :trauma: :triggered: :truama: :unfurling: :violence:
A pipe through tr can split those strings out to separate lines: grep -hx -- ':[:[:alnum:]]*:' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n' This will also remove the colons and an empty line will be present in the output (easy to repair, note the empty line will always be the first one due to the leading :). Add sort -u to sort and remove duplicates, or awk '!seen[$0]++' to remove duplicates without sorting. An approach with sed: sed '/^:/!d;s///;/:$/!d;s///;y/:/\n/' ~/Documents/wiki{,/diary}/*.mkd This also removes colons, but avoids adding empty lines (by removing the leading/trailing : with s before using y to transliterate remaining : to <newline>). sed could be combined with tr: sed '/:$/!d;/^:/!d;s///' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n' Using awk to work with the : separated fields, removing duplicates: awk -F: '/^:/ && /:$/ {for (i=2; i<NF; ++i) if (!seen[$i]++) print $i}' \ ~/Documents/wiki{,/diary}/*.mkd
Sample data generated by OPs initial grep: $ cat tags.raw :politics:violence: :positivity: :positivity:somewhat: :psychology: :socialServices:family: :strategy: :tech: :therapy:babylon: :trauma: :triggered: :truama:leadership:business:toxicity: :unfurling: :tagOne:tagTwo:etcTag: One while/for/printf idea based on associative arrays: unset arr typeset -A arr # declare array named 'arr' as associative while read -r line # for each line from tags.raw ... do for word in ${line//:/ } # replace ":" with space and process each 'word' separately do arr[${word}]=1 # create/overwrite arr[$word] with value 1; # objective is to make sure we have a single entry in arr[] for $word; # this eliminates duplicates done done < tags.raw printf ":%s:\n" "${!arr[#]}" | sort # pass array indices (ie, our unique list of words) to printf; # per OPs desired output we'll bracket each word with a pair of ':'; # then sort Per OPs comment/question about removing the array, a twist on the above where we eliminate the array in favor of printing from the internal loop and then piping everything to sort -u: while read -r line # for each line from tags.raw ... do for word in ${line//:/ } # replace ":" with space and process each 'word' separately do printf ":%s:\n" "${word}" # print ${word} to stdout done done < tags.raw | sort -u # pipe all output (ie, list of ${word}s for sorting and removing dups Both of the above generates: :babylon: :business: :etcTag: :family: :leadership: :politics: :positivity: :psychology: :socialServices: :somewhat: :strategy: :tagOne: :tagTwo: :tech: :therapy: :toxicity: :trauma: :triggered: :truama: :unfurling: :violence:
Extract orders and match to trades from two files
I have two attached files (orders1.txt and trades1.txt) I need to write a Bash script (possibly awk?) to extract orders and match them to trades. The output should produce a report that prints comma separated values containing “ClientID, OrderID, Price, Volume”. In addition to this for each client, I need to print the total volume and turnover (turnover is the subtotal of price * volume on each trade). Can someone please help me with a bash script that will do the above using the attached files? Any help would be greatly appreciated orders1.txt Entry Time, Client ID, Security ID, Order ID 25455410,DOLR,XGXUa,DOLR1435804437 25455410,XFKD,BUP3d,XFKD4746464646 25455413,QOXA,AIDl,QOXA7176202067 25455415,QOXA,IRUXb,QOXA6580494597 25455417,YXKH,OBWQs,YXKH4575139017 25455420,JBDX,BKNs,JBDX6760353333 25455428,DOLR,AOAb,DOLR9093170513 25455429,JBDX,QMP1Sh,JBDX2756804453 25455431,QOXA,QIP1Sh,QOXA6563975285 25455434,QOXA,XMUp,QOXA5569701531 25455437,XFKD,QLOJc,XFKD8793976660 25455438,YXKH,MRPp,YXKH2329856527 25455442,JBDX,YBPu,JBDX0100506066 25455450,QOXA,BUPYd,QOXA5832015401 25455451,QOXA,SIOQz,QOXA3909507967 25455451,DOLR,KID1Sh,DOLR2262067037 25455454,DOLR,JJHi,DOLR9923665017 25455461,YXKH,KBAPBa,YXKH2637373848 25455466,DOLR,EPYp,DOLR8639062962 25455468,DOLR,UQXKz,DOLR4349482234 25455474,JBDX,EFNs,JBDX7268036859 25455481,QOXA,XCB1Sh,QOXA4105943392 25455486,YXKH,XBAFp,YXKH0242733672 25455493,JBDX,BIF1Sh,JBDX2840241688 25455500,DOLR,QSOYp,DOLR6265839896 25455503,YXKH,IIYz,YXKH8505951163 25455504,YXKH,ZOIXp,YXKH2185348861 25455513,YXKH,MBOOp,YXKH4095442568 25455515,JBDX,P35p,JBDX9945514579 25455524,QOXA,YXOKz,QOXA1900595629 25455528,JBDX,XEQl,JBDX0126452783 25455528,XFKD,FJJMp,XFKD4392227425 25455535,QOXA,EZIp,QOXA4277118682 25455543,QOXA,YBPFa,QOXA6510879584 25455551,JBDX,EAMp,JBDX8924251479 25455552,QOXA,JXIQp,QOXA4360008399 25455554,DOLR,LISXPh,DOLR1853653280 25455557,XFKD,LOX14p,XFKD1759342196 25455558,JBDX,YXYb,JBDX8177118129 25455567,YXKH,MZQKl,YXKH6485420018 25455569,JBDX,ZPIMz,JBDX2010952336 25455573,JBDX,COPe,JBDX1612537068 25455582,JBDX,HFKAp,JBDX2409813753 25455589,QOXA,XFKm,QOXA9692126523 25455593,XFKD,OFYp,XFKD8556940415 25455601,XFKD,FKQLb,XFKD4861992028 25455606,JBDX,RIASp,JBDX0262502677 25455608,DOLR,HRKKz,DOLR1739013513 25455615,DOLR,ZZXp,DOLR6727725911 25455623,JBDX,CKQPp,JBDX2587184235 25455630,YXKH,ZLQQp,YXKH6492126889 25455632,QOXA,ORPz,QOXA3594333316 25455640,XFKD,HPIXSh,XFKD6780729432 25455648,QOXA,ABOJe,QOXA6661411952 25455654,XFKD,YLIp,XFKD6374702721 25455654,DOLR,BCFp,DOLR8012564477 25455658,JBDX,ZMDKz,JBDX6885176695 25455665,JBDX,CBOe,JBDX8942732453 25455670,JBDX,FRHMl,JBDX5424320405 25455679,DOLR,YFJm,DOLR8212353717 25455680,XFKD,XAFp,XFKD4132890550 25455681,YXKH,PBIBOp,YXKH6106504736 25455684,DOLR,IFDu,DOLR8034515043 25455687,JBDX,JACe,JBDX8243949318 25455688,JBDX,ZFZKz,JBDX0866225752 25455693,QOXA,XOBm,QOXA5011416607 25455694,QOXA,IDQe,QOXA7608439570 25455698,JBDX,YBIDb,JBDX8727773702 25455705,YXKH,MXOp,YXKH7747780955 25455710,YXKH,PBZRYs,YXKH7353828884 25455719,QOXA,QFDb,QOXA2477859437 25455720,XFKD,PZARp,XFKD4995735686 25455722,JBDX,ZLKKb,JBDX3564523161 25455730,XFKD,QFH1Sh,XFKD6181225566 25455733,JBDX,KWVJYc,JBDX7013108210 25455733,YXKH,ZQI1Sh,YXKH7095815077 25455739,YXKH,XIJp,YXKH0497248757 25455739,YXKH,ZXJp,YXKH5848658513 25455747,JBDX,XASd,JBDX4986246117 25455751,XFKD,XQIKz,XFKD5919379575 25455760,JBDX,IBXPb,JBDX8168710376 25455763,XFKD,EVAOi,XFKD8175209012 25455765,XFKD,JXKp,XFKD2750952933 25455773,XFKD,PTBAXs,XFKD8139382011 25455778,QOXA,XJp,QOXA8227838196 25455783,QOXA,CYBIp,QOXA2072297264 25455792,JBDX,PZI1Sh,JBDX7022115629 25455792,XFKD,XIKQl,XFKD6434550362 25455792,DOLR,YKPm,DOLR6394606248 25455796,QOXA,JXOXPh,QOXA9672544909 25455797,YXKH,YIWm,YXKH5946342983 25455803,YXKH,JZEm,YXKH5317189370 25455810,QOXA,OBMFz,QOXA0985316706 25455810,QOXA,DAJPp,QOXA6105975858 25455810,JBDX,FBBJl,JBDX1316207043 25455819,XFKD,YXKm,XFKD6946276671 25455821,YXKH,UIAUs,YXKH6010226371 25455828,DOLR,PTJXs,DOLR1387517499 25455836,DOLR,DCEi,DOLR3854078054 25455845,YXKH,NYQe,YXKH3727923537 25455853,XFKD,TAEc,XFKD5377097556 25455858,XFKD,LMBOXo,XFKD4452678489 25455858,XFKD,AIQXp,XFKD5727938304 trades1.txt # The first 8 characters is execution time in microseconds since midnight # The next 14 characters is the order ID # The next 8 characters is the zero padded price # The next 8 characters is the zero padded volume 25455416QOXA6580494597 0000013800001856 25455428JBDX6760353333 0000007000002458 25455434DOLR9093170513 0000000400003832 25455435QOXA6563975285 0000034700009428 25455449QOXA5569701531 0000007500009023 25455447YXKH2329856527 0000038300009947 25455451QOXA5832015401 0000039900006432 25455454QOXA3909507967 0000026900001847 25455456DOLR2262067037 0000034700002732 25455471YXKH2637373848 0000010900006105 25455480DOLR8639062962 0000027500001975 25455488JBDX7268036859 0000005200004986 25455505JBDX2840241688 0000037900002029 25455521YXKH4095442568 0000046400002150 25455515JBDX9945514579 0000040800005904 25455535QOXA1900595629 0000015200006866 25455533JBDX0126452783 0000001700006615 25455542XFKD4392227425 0000035500009948 25455570XFKD1759342196 0000025700007816 25455574JBDX8177118129 0000022400000427 25455567YXKH6485420018 0000039000008327 25455573JBDX1612537068 0000013700001422 25455584JBDX2409813753 0000016600003588 25455603XFKD4861992028 0000017600004552 25455611JBDX0262502677 0000007900003235 25455625JBDX2587184235 0000024300006723 25455658XFKD6374702721 0000046400009451 25455673JBDX6885176695 0000010900009258 25455671JBDX5424320405 0000005400003618 25455679DOLR8212353717 0000041100003633 25455697QOXA5011416607 0000018800007376 25455696QOXA7608439570 0000013000007463 25455716YXKH7747780955 0000037000006357 25455719QOXA2477859437 0000039300009840 25455723XFKD4995735686 0000045500009858 25455727JBDX3564523161 0000021300000639 25455742YXKH7095815077 0000023000003945 25455739YXKH5848658513 0000042700002084 25455766XFKD5919379575 0000022200003603 25455777XFKD8175209012 0000033300006350 25455788XFKD8139382011 0000034500007461 25455793QOXA8227838196 0000011600007081 25455784QOXA2072297264 0000017000004429 25455800XFKD6434550362 0000030000002409 25455801QOXA9672544909 0000039600001033 25455815QOXA6105975858 0000034800008373 25455814JBDX1316207043 0000026500005237 25455831YXKH6010226371 0000011400004945 25455838DOLR1387517499 0000046200006129 25455847YXKH3727923537 0000037400008061 25455873XFKD5727938304 0000048700007298 I have the following script: ''' #!/bin/bash declare -A volumes declare -A turnovers declare -A orders # Read the first file, remembering for each order the client id while read -r line do # Jump over comments if [[ ${line:0:1} == "#" ]] ; then continue; fi; details=($(echo $line | tr ',' " ")) order_id=${details[3]} client_id=${details[1]} orders[$order_id]=$client_id done < $1 echo "ClientID,OrderID,Price,Volume" while read -r line do # Jump over comments if [[ ${line:0:1} == "#" ]] ; then continue; fi; order_id=$(echo ${line:8:20} | tr -d '[:space:]') client_id=${orders[$order_id]} price=${line:28:8} volume=${line: -8} echo "$client_id,$order_id,$price,$volume" price=$(echo $price | awk '{printf "%d", $0}') volume=$(echo $volume | awk '{printf "%d", $0}') order_turnover=$(($price*$volume)) old_turnover=${turnovers[$client_id]} [[ -z "$old_turnover" ]] && old_turnover=0 total_turnover=$(($old_turnover+$order_turnover)) turnovers[$client_id]=$total_turnover old_volumes=${volumes[$client_id]} [[ -z "$old_volumes" ]] && old_volumes=0 total_volume=$((old_volumes+volume)) volumes[$client_id]=$total_volume done < $2 echo "ClientID,Volume,Turnover" for client_id in ${!volumes[#]} do volume=${volumes[$client_id]} turnover=${turnovers[$client_id]} echo "$client_id,$volume,$turnover" done Can anyone think of anything more elegant? Thanks in advance C
Assumption 1: the two files are ordered, so line x represents an action that is older than x+1. If not, then further work is needed. The assumption makes our work easier. Let's first change the delimiter of traders into a comma: sed -i 's/ /,/g' traders.txt This will be done in place for sake of simplicity. So, you now have traders which is comma separated, as is orders. This is the Assumption 2. Keep working on traders: split all columns and add titles1. More on the reasons why in a moment. gawk -i inplace -v INPLACE_SUFFIX=.bak 'BEGINFILE{FS=",";OFS=",";print "execution time,order ID,price,volume";}{print substr($1,1,8),substr($1,9),substr($2,1,9),substr($2,9)}' traders.txt Ugly but works. Now let's process your data using the following awk script: BEGIN { FS="," OFS="," } { if (1 == NR) { getline line < TRADERS # consume title line print "Client ID,Order ID,Price,Volume,Turnover"; # consume title line. Remove print to forget it getline line < TRADERS # reads first data line split(line, transaction, ",") next } if (transaction[2] == $4) { print $2, $4, transaction[3], transaction[4], transaction[3]*transaction[4] getline line < TRADERS # reads new data line split(line, transaction, ",") } } called by: gawk -f script -v TRADERS=traders.txt orders.txt And there you have it. Some caveats: check the numbers, as implicit gawk number conversion might not be correct with zero-padded numbers. There is a fix for that in case; getline might explode if we run out of lines from traders. I haven't put any check, that's up to you no control over timestamps. Match is based on Order ID. Output file: Client ID,Order ID,Price,Volume,Turnover QOXA,QOXA6580494597,000001380,00001856,2561280 JBDX,JBDX6760353333,000000700,00002458,1720600 DOLR,DOLR9093170513,000000040,00003832,153280 QOXA,QOXA6563975285,000003470,00009428,32715160 QOXA,QOXA5569701531,000000750,00009023,6767250 YXKH,YXKH2329856527,000003830,00009947,38097010 QOXA,QOXA5832015401,000003990,00006432,25663680 QOXA,QOXA3909507967,000002690,00001847,4968430 DOLR,DOLR2262067037,000003470,00002732,9480040 YXKH,YXKH2637373848,000001090,00006105,6654450 DOLR,DOLR8639062962,000002750,00001975,5431250 JBDX,JBDX7268036859,000000520,00004986,2592720 JBDX,JBDX2840241688,000003790,00002029,7689910 YXKH,YXKH4095442568,000004640,00002150,9976000 JBDX,JBDX9945514579,000004080,00005904,24088320 QOXA,QOXA1900595629,000001520,00006866,10436320 JBDX,JBDX0126452783,000000170,00006615,1124550 XFKD,XFKD4392227425,000003550,00009948,35315400 XFKD,XFKD1759342196,000002570,00007816,20087120 JBDX,JBDX8177118129,000002240,00000427,956480 YXKH,YXKH6485420018,000003900,00008327,32475300 JBDX,JBDX1612537068,000001370,00001422,1948140 JBDX,JBDX2409813753,000001660,00003588,5956080 XFKD,XFKD4861992028,000001760,00004552,8011520 JBDX,JBDX0262502677,000000790,00003235,2555650 JBDX,JBDX2587184235,000002430,00006723,16336890 XFKD,XFKD6374702721,000004640,00009451,43852640 JBDX,JBDX6885176695,000001090,00009258,10091220 JBDX,JBDX5424320405,000000540,00003618,1953720 DOLR,DOLR8212353717,000004110,00003633,14931630 QOXA,QOXA5011416607,000001880,00007376,13866880 QOXA,QOXA7608439570,000001300,00007463,9701900 YXKH,YXKH7747780955,000003700,00006357,23520900 QOXA,QOXA2477859437,000003930,00009840,38671200 XFKD,XFKD4995735686,000004550,00009858,44853900 JBDX,JBDX3564523161,000002130,00000639,1361070 YXKH,YXKH7095815077,000002300,00003945,9073500 YXKH,YXKH5848658513,000004270,00002084,8898680 XFKD,XFKD5919379575,000002220,00003603,7998660 XFKD,XFKD8175209012,000003330,00006350,21145500 XFKD,XFKD8139382011,000003450,00007461,25740450 QOXA,QOXA8227838196,000001160,00007081,8213960 QOXA,QOXA2072297264,000001700,00004429,7529300 XFKD,XFKD6434550362,000003000,00002409,7227000 QOXA,QOXA9672544909,000003960,00001033,4090680 QOXA,QOXA6105975858,000003480,00008373,29138040 JBDX,JBDX1316207043,000002650,00005237,13878050 YXKH,YXKH6010226371,000001140,00004945,5637300 DOLR,DOLR1387517499,000004620,00006129,28315980 YXKH,YXKH3727923537,000003740,00008061,30148140 XFKD,XFKD5727938304,000004870,00007298,35541260 1: requires gawk 4.1.0 or higher
How can I fix my bash script to find a random word from a dictionary?
I'm studying bash scripting and I'm stuck fixing an exercise of this site: https://ryanstutorials.net/bash-scripting-tutorial/bash-variables.php#activities The task is to write a bash script to output a random word from a dictionary whose length is equal to the number supplied as the first command line argument. My idea was to create a sub-dictionary, assign each word a number line, select a random number from those lines and filter the output, which worked for a similar simpler script, but not for this. This is the code I used: 6 DIC='/usr/share/dict/words' 7 SUBDIC=$( egrep '^.{'$1'}$' $DIC ) 8 9 MAX=$( $SUBDIC | wc -l ) 10 RANDRANGE=$((1 + RANDOM % $MAX)) 11 12 RWORD=$(nl "$SUBDIC" | grep "\b$RANDRANGE\b" | awk '{print $2}') 13 14 echo "Random generated word from $DIC which is $1 characters long:" 15 echo $RWORD and this is the error I get using as input "21": bash script.sh 21 script.sh: line 9: counterintelligence's: command not found script.sh: line 10: 1 + RANDOM % 0: division by 0 (error token is "0") nl: 'counterintelligence'\''s'$'\n''electroencephalograms'$'\n''electroencephalograph': No such file or directory Random generated word from /usr/share/dict/words which is 21 characters long: I tried in bash to split the code in smaller pieces obtaining no error (input=21): egrep '^.{'21'}$' /usr/share/dict/words | wc -l 3 but once in the script line 9 and 10 give error. Where do you think is the error?
problems SUBDIC=$( egrep '^.{'$1'}$' $DIC ) will store all words of the given length in the SUBDIC variable, so it's content is now something like foo bar baz. MAX=$( $SUBDIC | ... ) will try to run the command foo bar baz which is obviously bogus; it should be more like MAX=$(echo $SUBDIC | ... ) MAX=$( ... | wc -l ) will count the lines; when using the above mentioned echo $SUBDIC you will have multiple words, but all in one line... RWORD=$(nl "$SUBDIC" | ...) same problem as above: there's only one line (also note #armali's answer that nl requires a file or stdin) RWORD=$(... | grep "\b$RANDRANGE\b" | ...) might match the dictionary entry catch 22 likely RWORD=$(... | awk '{print $2}') won't handle lines containing spaces a simple solution doing a "random sort" over the all the possible words and taking the first line, should be sufficient: egrep "^.{$1}$" "${DIC}" | sort -R | head -1
MAX=$( $SUBDIC | wc -l ) - A pipe is used for connecting a command's output, while $SUBDIC isn't a command; an appropriate syntax is MAX=$( <<<$SUBDIC wc -l ). nl "$SUBDIC" - The argument to nl has to be a filename, which "$SUBDIC" isn't; an appropriate syntax is nl <<<"$SUBDIC".
This code will do it. My test dictionary of words is in file file. It's a good idea to get all words of a given length first but put them in an array not in var. And then get a random index and echo it. dic=( $(sed -n "/^.\{$1\}$/p" file) ) ind=$((0 + RANDOM % ${#dic[#]})) echo ${dic[$ind]}
I am also doing this activity and I create one simple solution. I create the script. #!/bin/bash awk "NR==$1 {print}" /usr/share/dict/words Here if you want a random string then you have to run the script as per the below command from the terminal. ./script.sh $RANDOM If you want the print any specific number string then you can run as per the below command from the terminal. ./script.sh 465
cat /usr/share/dict/american-english | head -n $RANDOM | tail -n 1 $RANDOM - Returns a different random number each time is it referred to. this simple line outputs random word from the mentioned dictionary. Otherwise as umläute mentined you can do: cat /usr/share/dict/american-english | sort -R | head -1
Print a row of 16 lines evenly side by side (column)
I have a file with unknown number of lines(but even number of lines). I want to print them side by side based on total number of lines in that file. For example, I have a file with 16 lines like below: asdljsdbfajhsdbflakjsdff235 asjhbasdjbfajskdfasdbajsdx3 asjhbasdjbfajs23kdfb235ajds asjhbasdjbfajskdfbaj456fd3v asjhbasdjb6589fajskdfbaj235 asjhbasdjbfajs54kdfbaj2f879 asjhbasdjbfajskdfbajxdfgsdh asjhbasdf3709ddjbfajskdfbaj 100 100 150 125 trh77rnv9vnd9dfnmdcnksosdmn 220 225 sdkjNSDfasd89asdg12asdf6asdf So now i want to print them side by side. as they have 16 lines in total, I am trying to get the results 8:8 like below asdljsdbfajhsdbflakjsdff235 100 asjhbasdjbfajskdfasdbajsdx3 100 asjhbasdjbfajs23kdfb235ajds 150 asjhbasdjbfajskdfbaj456fd3v 125 asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn asjhbasdjbfajs54kdfbaj2f879 220 asjhbasdjbfajskdfbajxdfgsdh 225 asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf paste command did not work for me exactly, (paste - - - - - - - -< file1) nor the awk command that I used awk '{printf "%s" (NR%2==0?RS:FS),$1}' Note: The number of lines in a file dynamic. The only known thing in my scenario is, they are even number all the time.
If you have the memory to hash the whole file ("max" below): $ awk '{ a[NR]=$0 # hash all the records } END { # after hashing mid=int(NR/2) # compute the midpoint, int in case NR is uneven for(i=1;i<=mid;i++) # iterate from start to midpoint print a[i],a[mid+i] # output }' file If you have the memory to hash half of the file ("mid"): $ awk ' NR==FNR { # on 1st pass hash second half of records if(FNR>1) { # we dont need the 1st record ever a[FNR]=$0 # hash record if(FNR%2) # if odd record delete a[int(FNR/2)+1] # remove one from the past } next } FNR==1 { # on the start of 2nd pass if(NR%2==0) # if record count is uneven exit # exit as there is always even count of them offset=int((NR-1)/2) # compute offset to the beginning of hash } FNR<=offset { # only process the 1st half of records print $0,a[offset+FNR] # output one from file, one from hash next } { # once 1st half of 2nd pass is finished exit # just exit }' file file # notice filename twice And finally if you have awk compiled into a worms brain (ie. not so much memory, "min"): $ awk ' NR==FNR { # just get the NR of 1st pass next } FNR==1 { mid=(NR-1)/2 # get the midpoint file=FILENAME # filename for getline while(++i<=mid && (getline line < file)>0); # jump getline to mid } { if((getline line < file)>0) # getline read from mid+FNR print $0,line # output }' file file # notice filename twice Standard disclaimer on getline and no real error control implemented. Performance: I seq 1 100000000 > file and tested how the above solutions performed. Output was > /dev/null but writing it to a file lasted around 2 s longer. max performance is so-so as the mem print was 88 % of my 16 GB so it might have swapped. Well, I killed all the browsers and shaved off 7 seconds for the real time of max. +------------------+-----------+-----------+ | which | | | | min | mid | max | +------------------+-----------+-----------+ | time | | | | real 1m7.027s | 1m30.146s | 0m48.405s | | user 1m6.387s | 1m27.314 | 0m43.801s | | sys 0m0.641s | 0m2.820s | 0m4.505s | +------------------+-----------+-----------+ | mem | | | | 3 MB | 6.8 GB | 13.5 GB | +------------------+-----------+-----------+ Update: I tested #DavidC.Rankin's and #EdMorton's solutions and they ran, respectively: real 0m41.455s user 0m39.086s sys 0m2.369s and real 0m39.577s user 0m37.037s sys 0m2.541s Mem print was about the same as my mid had. It pays to use the wc, it seems.
$ pr -2t file asdljsdbfajhsdbflakjsdff235 100 asjhbasdjbfajskdfasdbajsdx3 100 asjhbasdjbfajs23kdfb235ajds 150 asjhbasdjbfajskdfbaj456fd3v 125 asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn asjhbasdjbfajs54kdfbaj2f879 220 asjhbasdjbfajskdfbajxdfgsdh 225 asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf if you want just one space between columns, change to $ pr -2ts' ' file
You can also do it with awk simply by storing the first-half of the lines in an array and then concatenating the second half to the end, e.g. awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file Example Use/Output With your data in file, then $ awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file asdljsdbfajhsdbflakjsdff235 100 asjhbasdjbfajskdfasdbajsdx3 100 asjhbasdjbfajs23kdfb235ajds 150 asjhbasdjbfajskdfbaj456fd3v 125 asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn asjhbasdjbfajs54kdfbaj2f879 220 asjhbasdjbfajskdfbajxdfgsdh 225 asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
Extract the first half of the file and the last half of the file and merge the lines: paste <(head -n $(($(wc -l <file.txt)/2)) file.txt) <(tail -n $(($(wc -l <file.txt)/2)) file.txt) You can use columns utility from autogen: columns -c2 --by-columns file.txt You can use column, but the count of columns is calculated in a strange way from the count of columns of your terminal. So assuming your lines have 28 characters, you also can: column -c $((28*2+8)) file.txt
I do not want to solve this, but if I were you: wc -l file.txt gives number of lines echo $(($(wc -l < file.txt)/2)) gives a half head -n $(($(wc -l < file.txt)/2)) file.txt > first.txt tail -n $(($(wc -l < file.txt)/2)) file.txt > last.txt create file with first half and last half of the original file. Now you can merge those files together side by side as it was described here .
Here is my take on it using the bash shell wc(1) and ed(1) #!/usr/bin/env bash array=() file=$1 total=$(wc -l < "$file") half=$(( total / 2 )) plus1=$(( half + 1 )) for ((m=1;m<=half;m++)); do array+=("${plus1}m$m" "${m}"'s/$/ /' "${m}"',+1j') done After all of that if just want to print the output to stdout. Add the line below to the script. printf '%s\n' "${array[#]}" ,p Q | ed -s "$file" If you want to write the changes directly to the file itself, Use this code instead below the script. printf '%s\n' "${array[#]}" w | ed -s "$file" Here is an example. printf '%s\n' {1..10} > file.txt Now running the script against that file. ./myscript file.txt Output 1 6 2 7 3 8 4 9 5 10 Or using bash4+ feature mapfile aka readarray Save the file in an array named array. mapfile -t array < file.txt Separate the files. left=("${array[#]::((${#array[#]} / 2))}") right=("${array[#]:((${#array[#]} / 2 ))}") loop and print side-by-side for i in "${!left[#]}"; do printf '%s %s\n' "${left[i]}" "${right[i]}" done What you said The only known thing in my scenario is, they are even number all the time. That solution should work.
sort fasta by sequence size
I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic): >id1 sequence 1 # could be on several line >id2 sequence 2 ... I have run a tools that give me in tsv format: the Identifiant, the length, and the position in bytes of the identifiant. for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file. # this fonction will get the sequence using seek def get_seq(file, bites): with open(file) as f_: f_.seek(bites, 0) # go to the line of interest line = f_.readline().strip() # this line is the begin of the #sequence to_return = "" # init the string which will contains the sequence while not line.startswith('>') or not line: # while we do not # encounter another identifiant to_return += line line = f_.readline().strip() return to_return # simply append to a file the id and the sequence def write_seq(out_file, id_, sequence): with open(out_file, 'a') as out_file: out_file.write('>{}\n{}\n'.format(id_.strip(), sequence)) # main loop will parse the index file and call the function defined below with open(args.fai) as ref: indice = 0 for line in ref: spt = line.split() id_ = spt[0] seq = get_seq(args.i, int(spt[2])) write_seq(out_file=args.out, id_=id_, sequence=seq) my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?
Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools: subprocess.call(["samtools", "faidx", args.i]) with open(args.fai) as ref: for line in ref: spt = line.split() id_ = spt[0] subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)
What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster. csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}' for file in tmp_fasta_file_* do TMP_FASTA_WC=$(wc -l < $file | tr -d ' ') FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n") done for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}') do cat "$filename" >> $2 done rm tmp_fasta_file* First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta
Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L option I added, and convert back to fasta: cat test.fasta \ | tee >(wc -l > nb_lines_fasta.txt) \ | bioawk -c fastx '{l = length($seq); printf "#"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \ | tee >(wc -l > nb_lines_fastq.txt) \ | fastq-sort -L \ | tee >(wc -l > nb_lines_fastq_sorted.txt) \ | bioawk -c fastx '{print ">"$name"\n"$seq}' \ | tee >(wc -l > nb_lines_fasta_sorted.txt) \ > test_sorted.fasta The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers. The IIIII... string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads. In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII... string was too short. The resulting fasta file will have the shorter sequences first. To get the longest sequences at the top of the file, add the -r option to fastq-sort. Note that fastq-sort writes intermediate files in /tmp. If for some reason it is interrupted before erasing them, you may want to clean your /tmp manually and not wait for the next reboot. Edit I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself: cat test.fasta \ | bioawk -c fastx '{print "#"$name"\n"$seq"\n+\n"$seq}' \ | fastq-sort -L \ | bioawk -c fastx '{print ">"$name"\n"$seq}' \ > test_sorted.fasta This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf and the usage of tee to check intermediate data length may be interesting to know about.
You can also do it very conveniently with awk, check the code below: awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\ awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\ sort -k1,1n | cut -f 2- | tr "\t" "\n" This and other methods have been posted in Biostars (e.g. using BBMap's sortbyname.sh script), and I strongly recommend this community for questions such like this one.