Bash: Loop Read N lines at time from CSV

Bash: Loop Read N lines at time from CSV - linux

I have a csv file of 100000 ids
wef7efwe1fwe8
wef7efwe1fwe3
ewefwefwfwgrwergrgr
that are being transformed into a json object using jq
output=$(jq -Rsn '
{"id":
[inputs
| . / "\n"
| (.[] | select(length > 0) | . / ";") as $input
| $input[0]]}
' <$FILE)
output
{
"id": [
"wef7efwe1fwe8",
"wef7efwe1fwe3",
....
]
}
currently, I need to manually split the file into smaller 10000 line files... because the API call has a limit.
I would like a way to automatically loop through the large file... and only use 10000 lines as a time as $FILE... up until the end of the list.

I would use the split command and write a little shell script around it:
#!/bin/bash
input_file=ids.txt
temp_dir=splits
api_limit=10000
# Make sure that there are no leftovers from previous runs
rm -rf "${temp_dir}"
# Create temporary folder for splitting the file
mkdir "${temp_dir}"
# Split the input file based on the api limit
split --lines "${api_limit}" "${input_file}" "${temp_dir}/"
# Iterate through splits and make an api call per split
for split in "${temp_dir}"/* ; do
jq -Rsn '
{"id":
[inputs
| . / "\n"
| (.[] | select(length > 0) | . / ";") as $input
| $input[0]]
}' "${split}" > api_payload.json
# now do something ...
# curl -dapi_payload.json http://...
rm -f api_payload.json
done
# Clean up
rm -rf "${temp_dir}"

Here's a simple and efficient solution that at its core just uses jq. It takes advantage of the -c command-line option. I've used xargs printf ... for illustration - mainly to show how easy it is to set up a shell pipeline.
< data.txt jq -Rnc '
def batch($n; stream):
def b: [limit($n; stream)]
| select(length > 0)
| (., b);
b;
{id: batch(10000; inputs | select(length>0) | (. / ";")[0])}
' | xargs printf "%s\n"
Parameterizing batch size
It might make sense to set things up so that the batch size is specified outside the jq program. This could be done in numerous ways, e.g. by invoking jq along the lines of:
jq --argjson n 10000 ....
and of course using $n instead of 10000 in the jq program.
Why “def b:”?
For efficiency. jq’s TCO (tail recursion optimization) only works for arity-0 filters.
Note on -s
In the Q as originally posted, the command-line options -sn are used in conjunction with inputs. Using -s with inputs defeats the whole purpose of inputs, which is to make it possible to process input in a stream-oriented way (i.e. one line of input or one JSON entity at a time).

Related

how to loop through string for patterns from linux shell?

I have a script that looks through files in a directory for strings like :tagName: which works fine for single :tag: but not for multiple :tagOne:tagTwo:tagThree: tags.
My current script does:
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|.*(:[Aa-Zz]*:)|\1|g' | \
sort -u
printf '\nNote: this fails to display combined :tagOne:tagTwo:etcTag:\n'
The first line is generating an output like this:
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
And the objective is to get that into a list of single :tag:'s.
Again, the problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic.
**I want :tagOne:tagTwo:etcTag: to be turned this into:
:tagOne:
:tagTwo:
:etcTag:
and so forth with :politics:violence: etc.
Colons aren't necessary, tagOne is just as good (maybe better, but this is trivial) than :tagOne:.
The problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic.
So I should replace the sed with something better...
I've tried:
A smarter sed:
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sort -u
...which works (for a limited number of tags) except that it produces weird results like:
:toxicity:p:
:somewhat:y:
:people:n:
...placing weird random letters at the end of some tags in which :p: is the final character of the :leadership: tag and "leadership" no longer appears in the list. Same for :y: and :n:.
I've also tried using loops in a couple ways...
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sort -u | grep lead
...which has the same problem of :leadership: tags being lost etc.
And like...
for m in $(grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd); do
for t in $(echo $m | grep -e ':[Aa-Zz]*:'); do
printf "$t\n";
done
done | sort -u
...which doesn't separate the tags at all, just prints stuff like:
:truama:leadership:business:toxicity
Should I be taking some other approach? Using a different utility (perhaps cut inside a loop)? Maybe doing this in python (I have a few python scripts but don't know the language well, but maybe this would be easy to do that way)? Every time I see awk I think "EEK!" so I'd prefer a non-awk solution please, preferring to stick to paradigms I've used in order to learn them better.

Using PCRE in grep (where available) and positive lookbehind:
$ echo :tagOne:tagTwo:tagThree: | grep -Po "(?<=:)[^:]+:"
tagOne:
tagTwo:
tagThree:
You will lose the leading : but get the tags nevertheless.
Edit: Did someone mention awk?:
$ awk '{
while(match($0,/:[^:]+:/)) {
a[substr($0,RSTART,RLENGTH)]
$0=substr($0,RSTART+1)
}
}
END {
for(i in a)
print i
}' file

Another idea using awk ...
Sample data generated by OPs initial grep:
$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
One awk idea:
awk '
{ split($0,tmp,":") # split input on colon;
# NOTE: fields #1 and #NF are the empty string - see END block
for ( x in tmp ) # loop through tmp[] indices
{ arr[tmp[x]] } # store tmp[] values as arr[] indices; this eliminates duplicates
}
END { delete arr[""] # remove the empty string from arr[]
for ( i in arr ) # loop through arr[] indices
{ printf ":%s:\n", i } # print each tag on separate line leading/trailing colons
}
' tags.raw | sort # sort final output
NOTE: I'm not up to speed on awk's ability to internally sort arrays (thus eliminating the external sort call) so open to suggestions (or someone can copy this answer to a new one and update with said ability?)
The above also generates:
:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:

A pipe through tr can split those strings out to separate lines:
grep -hx -- ':[:[:alnum:]]*:' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
This will also remove the colons and an empty line will be present in the output (easy to repair, note the empty line will always be the first one due to the leading :). Add sort -u to sort and remove duplicates, or awk '!seen[$0]++' to remove duplicates without sorting.
An approach with sed:
sed '/^:/!d;s///;/:$/!d;s///;y/:/\n/' ~/Documents/wiki{,/diary}/*.mkd
This also removes colons, but avoids adding empty lines (by removing the leading/trailing : with s before using y to transliterate remaining : to <newline>). sed could be combined with tr:
sed '/:$/!d;/^:/!d;s///' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
Using awk to work with the : separated fields, removing duplicates:
awk -F: '/^:/ && /:$/ {for (i=2; i<NF; ++i) if (!seen[$i]++) print $i}' \
~/Documents/wiki{,/diary}/*.mkd

Sample data generated by OPs initial grep:
$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
One while/for/printf idea based on associative arrays:
unset arr
typeset -A arr # declare array named 'arr' as associative
while read -r line # for each line from tags.raw ...
do
for word in ${line//:/ } # replace ":" with space and process each 'word' separately
do
arr[${word}]=1 # create/overwrite arr[$word] with value 1;
# objective is to make sure we have a single entry in arr[] for $word;
# this eliminates duplicates
done
done < tags.raw
printf ":%s:\n" "${!arr[#]}" | sort # pass array indices (ie, our unique list of words) to printf;
# per OPs desired output we'll bracket each word with a pair of ':';
# then sort
Per OPs comment/question about removing the array, a twist on the above where we eliminate the array in favor of printing from the internal loop and then piping everything to sort -u:
while read -r line # for each line from tags.raw ...
do
for word in ${line//:/ } # replace ":" with space and process each 'word' separately
do
printf ":%s:\n" "${word}" # print ${word} to stdout
done
done < tags.raw | sort -u # pipe all output (ie, list of ${word}s for sorting and removing dups
Both of the above generates:
:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:

Interleave lines sorted by a column

(Similar to How to interleave lines from two text files but for a single input. Also similar to Sort lines by group and column but interleaving or randomizing versus sorting.)
I have a set of systems and tasks in two columns, SYSTEM,TASK:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
beta,33452909
beta,40850198
beta,82645731
gamma,64910850
I want to distribute the tasks to each system in a balanced way. The ideal case where each system has the same number of tasks would be round-robin, one alpha then one beta then one gamma and repeat until finished.
I get the whole list of tasks + systems at once, so I don't need to keep any state
The list of systems is not static, on the order of N=100
The total number of tasks is variable, on the order of N=500
The number of tasks for each system is not guaranteed to be equal
Hard / absolute interleaving isn't required, as long as there aren't two of the same system twice in a row
The same task may show up more than once, but not for the same system
Input format / delimiter can be changed
I can solve this well enough with some fancy scripting to split the data into multiple files (grep ^alpha, input > alpha.txt etc) and then recombine them with paste or similar, but I'd like to use a single command or set of pipes to run it without intermediate files or a proper scripting language. Just using sort -R gets me 95% of the way there, but I end up with 2 tasks for the same system in a row almost every time, and sometimes 3 or more depending on the initial distribution.
edit:
To clarify, any output should not have the same system on two lines in a row. All system,task pairs must be preserved, you can't move a task from one system to another - that'd make this really easy!
One of several possible sample outputs:
beta,40850198
alpha,90198500
beta,82645731
alpha,93082105
gamma,64910850
beta,21700055
alpha,30184438
beta,33452909

We start with by answering the underlying theoretical problem. The problem is not as simple as it seems. Feel free to implement a script based on this answer.
The blocks formatted as quotes are not quotes. I just wanted to highlight them to improve navigation in this rather long answer.
Theoretical Problem
Given a finite set of letters L with frequencies f : L→ℕ0, find a sequence of letters such that every letter ℓ appears exactly f(ℓ) times and adjacent elements of the sequence are always different.
Example
L = {a,b,c} with f(a)=4, f(b)=2, f(c)=1
ababaca, acababa, and abacaba are all valid solutions.
aaaabbc is invalid – Some adjacent elements are equal, for instance aa or bb.
ababac is invalid – The letter a appears 3 times, but its frequency is f(a)=4
cababac is invalid – The letter c appears 2 times, but its frequency is f(c)=1
Solution
The following approach produces a valid sequence if and only if there exists a solution.
Sort the letters by their frequencies.
For ease of notation we assume, without loss of generality, that f(a) ≥ f(b) ≥ f(c) ≥ ... ≥ 0.
Note: There exists a solution if and only if f(a) ≤ 1 + ∑ℓ≠a f(ℓ).
Write down a sequence s of f(a) many a.
Add the remaining letters into a FIFO working list, that is:
(Don't add any a)
First add f(b) many b
Then f(c) many c
and so on
Iterate from left to right over the sequence s and insert after each element a letter from the working list. Repeat this step until the working list is empty.
Example
L = {a,b,c,d} with f(a)=5, f(b)=5, f(c)=4, f(d)=2
The letters are already sorted by their frequencies.
s = aaaaa
workinglist = bbbbbccccdd. The leftmost entry is the first one.
We iterate from left to right. The places where we insert letters from the working list are marked with an _ underscore.
s = a_a_a_a_a_ workinglist = bbbbbccccdd
s = aba_a_a_a_ workinglist = bbbbccccdd
s = ababa_a_a_ workinglist = bbbccccdd
...
s = ababababab workinglist = ccccdd
⚠️ We reached the end of sequence s. We repeat step 4.
s = a_b_a_b_a_b_a_b_a_b_ workinglist = ccccdd
s = acb_a_b_a_b_a_b_a_b_ workinglist = cccdd
...
s = acbcacb_a_b_a_b_a_b_ workinglist = cdd
s = acbcacbca_b_a_b_a_b_ workinglist = dd
s = acbcacbcadb_a_b_a_b_ workinglist = d
s = acbcacbcadbda_b_a_b_ workinglist =
⚠️ The working list is empty. We stop.
The final sequence is acbcacbcadbdabab.
Implementation In Bash
Here is a bash implementation of the proposed approach that works with your input format. Instead of using a working list each line is labeled with a binary floating point number specifying the position of that line in the final sequence. Then the lines are sorted by their labels. That way we don't have to use explicit loops. Intermediate results are stored in variables. No files are created.
#! /bin/bash
inputFile="$1" # replace $1 by your input file or call "./thisScript yourFile"
inputBySys="$(sort "$inputFile")"
sysFreqBySys="$(cut -d, -f1 <<< "$inputBySys" | uniq -c | sed 's/^ *//;s/ /,/')"
inputBySysFreq="$(join -t, -1 2 -2 1 <(echo "$sysFreqBySys") <(echo "$inputBySys") | sort -t, -k2,2nr -k1,1)"
maxFreq="$(head -n1 <<< "$inputBySysFreq" | cut -d, -f2)"
lineCount="$(wc -l <<< "$inputBySysFreq")"
increment="$(awk '{l=log($1/$2)/log(2); l=int(l)-(int(l)>l); print 2^l}' <<< "$maxFreq $lineCount")"
seq="$({ echo obase=2; seq 0 "$increment" "$maxFreq" | head -n-1; } | bc |
awk -F. '{sub(/0*$/,"",$2); print 0+$1 "," $2 "," length($2)}' |
sort -snt, -k3,3 -k2,2 | head -n "$lineCount")"
paste -d, <(echo "$seq") <(echo "$inputBySysFreq") | sort -nt, -k1,1 -k2,2 | cut -d, -f4,6
This solution could fail for very long input files due to the limited precision of floating point numbers in seq and awk.

Well, this is what I've come up with:
args=()
while IFS=' ' read -r _ name; do
# add a file redirection with grepped certain SYSTEM only for later eval
args+=("<(grep '^$name,' file)")
done < <(
# extract SYSTEM only
<file cut -d, -f1 |
#sort with the count
sort | uniq -c | sort -nr
)
# this is actually safe, because we control all arguments
eval paste -d "'\\n'" "${args[#]}" |
# paste will insert empty lines when the list ended - remove them
sed '/^$/d'
First, I extract and sort the SYSTEM names in the order which occurs the most often to be first. So for the input example we get:
4 beta
3 alpha
1 gamme
Then for each such name I add the proper string <(grep '...' file) to arguments list witch will be later evalulated.
Then I evalulate the call to paste <(grep ...) <(grep ...) <(grep ...) ... with newline as the paste's delimeter. I remove empty lines with simple sed call.
The output for the input provided:
beta,21700055
alpha,90198500
gamma,64910850
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
Converted to a fancy oneliner, with substituting the while read with command substitution and sed. Got safe with inputfile naming with printf "%q" "$inputfile" and double quoting inside sed regex.
inputfile="file"
fieldsep=","
eval paste -d '"\\n"' "$(
cut -d "$fieldsep" -f1 "$inputfile" |
sort | uniq -c | sort -nr |
sed 's/^[[:space:]]*[0-9]\+[[:space:]]*\(.*\)$/<(grep '\''^\1'"$fieldsep"\'' "'"$(printf "%q" "$inputfile")"'")/' |
tr '\n' ' '
)" |
sed '/^$/d'

inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
}
To accommodate for strange input values this script implements somewhat simple but hardy in bash statemachine.
The variable counts stores SYSTEM names with their're occurrence count. So from the example input it will be
4 alpha
3 beta
1 gamma
Now - we output the SYSTEM name with the biggest occurrence count that is also different from the last outputted SYSTEM name. We decrement it's occurrence count. If the count is equal to zero, it is removed from the list. We remember the last outputted SYSTEM name. We repeat this process until all occurrence counts reach zero, so the list is empty. For the example input this will output:
beta
alpha
beta
alpha
beta
alpha
beta
gamma
Now, we need to join that list with the job names. We can't use join as the input is not sorted and we don't want to change the ordering. So what I do, I get only SYSTEM names in system. Then for each system I open a different file descriptor with filtered only that SYSTEM name from the input file. All the file descriptors are stored in an array. Then for each SYSTEM name from the input, I find the file descriptor that filters that SYSTEM name from the input file and read exactly one line from the file descriptor. This works like an array of file positions each file position associated / filtering specified SYSTEM name.
beta,21700055
alpha,90198500
beta,33452909
alpha,93082105
beta,40850198
alpha,30184438
beta,82645731
gamma,64910850
The script was done so for the input in the form of:
alpha,90198500
alpha,93082105
alpha,30184438
beta,21700055
gamma,64910850
the script outputs correctly:
alpha,90198500
gamma,64910850
alpha,93082105
beta,21700055
alpha,30184438
I think this algorithm will mostly always print correct output, but the ordering is so that the least common SYSTEMs will be outputted last, which may be not optimal.
Tested manually with some custom tests and checker on paiza.io.
inputfile="inputfile"
in=( 1 2 1 5 )
cat <<EOF > "$inputfile"
$(seq ${in[0]} | sed 's/^/A,/' )
$(seq ${in[1]} | sed 's/^/B,/' )
$(seq ${in[2]} | sed 's/^/C,/' )
$(seq ${in[3]} | sed 's/^/D,/' )
EOF
sed -i -e '/^$/d' "$inputfile"
inputfile="inputfile"
fieldsep=","
# remember SYSTEMS with it's occurrence counts
counts=$(cut -d "$fieldsep" -f1 "$inputfile" | sort | uniq -c)
# I think this holds true
# The SYSTEM with the most count should be lower than the sum of all others
# remember last outputted system name
lastsys=''
# until there are any systems with counts
while ((${#counts})); do
# get the most occurrented system with it's count from counts
IFS=' ' read -r cnt sys < <(
# if lastsys is empty, don't do anything, if not, filter it out
if [ -n "$lastsys" ]; then
grep -v " $lastsys$";
else
cat;
# ha suprise - counts is here!
# probably would be way more readable with just `printf "%s" "$counts" |`
fi <<<"$counts" |
# with the most occurence
sort -n | tail -n1
)
if [ -z "$cnt" ]; then
echo "ERROR: constructing output is not possible! There have to be duplicate system lines!" >&2
exit 1
fi
# update counts - decrement the count of this system, or remove it if count is 1
counts=$(
# remove current system from counts
<<<"$counts" grep -v " $sys$"
# if the count of the system is 1, don't add it back - it's count is now 0
if ((cnt > 1)); then
# decrement count and add the line with system to counts
printf "%s" "$((cnt - 1)) $sys"
fi
)
# finally print output
printf "%s\n" "$sys"
# and remember last system
lastsys="$sys"
done |
{
# get system names only in `system` - using cached counts variable
# for each system name open a grep for that name from the input file
# with asigned file descritpro
# The file descriptor list is saved in an array `fds`
fds=()
systems=""
while IFS=' ' read -r _ sys; do
exec {fd}< <(grep "^$sys," "$inputfile")
fds+=("$fd")
systems+="$sys"$'\n'
done <<<"$counts"
# for each line in input
while IFS='' read -r sys; do
# get the position inside systems list of that system decremented by 1
# this will be the underlying filesystem for filtering that system out of input
fds_idx=$(<<<"$systems" grep -n "$sys" | cut -d: -f1)
fds_idx=$((fds_idx - 1))
# read one line from that file descriptor
# I wonder is `sed 1p` would be faster
IFS='' read -r -u "${fds[$fds_idx]}" line
# output that line
printf "%s\n" "$line"
done
} |
{
# check if the output is correct
output=$(cat)
# output should have same lines as inputfile
if ! cmp <(sort "$inputfile") <(<<<"$output" sort); then
echo "Output does not match input!" >&2
exit 1
fi
# two consecutive lines can't have the same system
lastsys=""
<<<"$output" cut -d, -f1 |
while IFS= read -r sys; do
if [ -n "$lastsys" -a "$lastsys" = "$sys" ]; then
echo "Same systems found on two consecutive lines!" >&2
exit 1
fi
lastsys="$sys"
done
# all ok
echo "all ok!"
echo -------------
printf "%s\n" "$output"
}
exit

sort fasta by sequence size

I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):
>id1
sequence 1 # could be on several line
>id2
sequence 2
...
I have run a tools that give me in tsv format:
the Identifiant, the length, and the position in bytes of the identifiant.
for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.
# this fonction will get the sequence using seek
def get_seq(file, bites):
with open(file) as f_:
f_.seek(bites, 0) # go to the line of interest
line = f_.readline().strip() # this line is the begin of the
#sequence
to_return = "" # init the string which will contains the sequence
while not line.startswith('>') or not line: # while we do not
# encounter another identifiant
to_return += line
line = f_.readline().strip()
return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):
with open(out_file, 'a') as out_file:
out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))
# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:
indice = 0
for line in ref:
spt = line.split()
id_ = spt[0]
seq = get_seq(args.i, int(spt[2]))
write_seq(out_file=args.out, id_=id_, sequence=seq)
my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?

Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:
subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:
for line in ref:
spt = line.split()
id_ = spt[0]
subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)

What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster.
csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}'
for file in tmp_fasta_file_*
do
TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n")
done
for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}')
do
cat "$filename" >> $2
done
rm tmp_fasta_file*
First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta

Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L option I added, and convert back to fasta:
cat test.fasta \
| tee >(wc -l > nb_lines_fasta.txt) \
| bioawk -c fastx '{l = length($seq); printf "#"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
| tee >(wc -l > nb_lines_fastq.txt) \
| fastq-sort -L \
| tee >(wc -l > nb_lines_fastq_sorted.txt) \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
| tee >(wc -l > nb_lines_fasta_sorted.txt) \
> test_sorted.fasta
The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.
The IIIII... string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads.
In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII... string was too short.
The resulting fasta file will have the shorter sequences first.
To get the longest sequences at the top of the file, add the -r option to fastq-sort.
Note that fastq-sort writes intermediate files in /tmp. If for some reason it is interrupted before erasing them, you may want to clean your /tmp manually and not wait for the next reboot.
Edit
I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself:
cat test.fasta \
| bioawk -c fastx '{print "#"$name"\n"$seq"\n+\n"$seq}' \
| fastq-sort -L \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
> test_sorted.fasta
This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf and the usage of tee to check intermediate data length may be interesting to know about.

You can also do it very conveniently with awk, check the code below:
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -k1,1n | cut -f 2- | tr "\t" "\n"
This and other methods have been posted in Biostars (e.g. using BBMap's sortbyname.sh script), and I strongly recommend this community for questions such like this one.

saving information as a variable in the middle of a series of pipes

I am trying to save some information about a intermediate file in a series of pipes. This information can be gleaned from the first few dozen lines, so ideally I would like to not process the entire thing twice.
The end result I want is a integer value stored as a variable that the script can then make use of for the next step.
What I have so far is
samtools bamshuf -Ou test.bam tmp | tee >($(eval READ_LEN=$(awk '{print length($10)}' | head -100 | sort -u))) | samtools bam2fq - | gzip -f > $OUT
Where I would like READ_LEN to contain the first sorted length of the first 100 lines of column 10 in the input file.
When I run this, I get no errors, but READ_LEN is not set. I assume this is because of the use of eval, and so stdout is not being piped on to awk.
How can I save information into a variable like this in the middle of a series of pipes?

The variable READ_LEN is set in a sub-shell (because it is included in the "$(...). When the sub-shell exits the variable is destroyed. Capture the value and set the variable in the parent shell. Something like
while read value; do
[ -n "$READ_LEN" ] && READLEN+=" "
READ_LEN+=$value
done < <(samtools bamshuf -Ou test.bam tmp | awk '{print length($10)}' | head -100 | sort -u)
Then use READ_LEN for the remainder of the processing

How to select random lines from a file

I have a text file containing 10 hundreds of lines, with different lengths. Now I want to select N lines randomly, save them in another file, and remove them from the original file.
I've found some answers to this question, but most of them use a simple idea: sort the file and select first or last N lines. unfortunately this idea doesn't work to me, because I want to preserve the order of lines.
I tried this piece of code, but it's very slow and takes hours.
FILEsrc=$1;
FILEtrg=$2;
MaxLines=$3;
let LineIndex=1;
while [ "$LineIndex" -le "$MaxLines" ]
do
# count number of lines
NUM=$(wc -l $FILEsrc | sed 's/[ \r\t].*$//g');
let X=(${RANDOM} % ${NUM} + 1);
echo $X;
sed -n ${X}p ${FILEsrc}>>$FILEtrg; #write selected line into target file
sed -i -e ${X}d ${FILEsrc}; #remove selected line from source file
LineIndex=`expr $LineIndex + 1`;
done
I found this line the most time consuming one in the code:
sed -i -e ${X}d ${FILEsrc};
is there any way to overcome this problem and make the code faster?
Since I'm in hurry, may I ask you to send me complete c/c++ code for doing this?

A simple O(n) algorithm is described in:
http://en.wikipedia.org/wiki/Reservoir_sampling
array R[k]; // result
integer i, j;
// fill the reservoir array
for each i in 1 to k do
R[i] := S[i]
done;
// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
j := random(1, i); // important: inclusive range
if j <= k then
R[j] := S[i]
fi
done

Generate all your offsets, then make a single pass through the file. Assuming you have the desired number of offsets in offsets (one number per line) you can generate a single sed script like this:
sed "s!.*!&{w $FILEtrg\nd;}!" offsets
The output is a sed script which you can save to a temporary file, or (if your sed dialect supports it) pipe to a second sed instance:
... | sed -i -f - "$FILEsrc"
Generating the offsets file left as an exercise.
Given that you have the Linux tag, this should work right off the bat. The default sed on some other platforms may not understand \n and/or accept -f - to read the script from standard input.
Here is a complete script, updated to use shuf (thanks #Thor!) to avoid possible duplicate random numbers.
#!/bin/sh
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
# Add a line number to each input line
nl -ba "$FILEsrc" |
# Rearrange lines
shuf |
# Pick out the line number from the first $MaxLines ones into sed script
sed "1,${MaxLines}s!^ *\([1-9][0-9]*\).*!\1{w $FILEtrg\nd;}!;t;D;q" |
# Run the generated sed script on the original input file
sed -i -f - "$FILEsrc"

[I've updated each solution to remove selected lines from the input, but I'm not positive the awk is correct. I'm partial to the bash solution myself, so I'm not going to spend any time debugging it. Feel free to edit any mistakes.]
Here's a simple awk script (the probabilities are simpler to manage with floating point numbers, which don't mix well with bash):
tmp=$(mktemp /tmp/XXXXXXXX)
awk -v total=$(wc -l < "$FILEsrc") -v maxLines=$MaxLines '
BEGIN { srand(); }
maxLines==0 { exit; }
{ if (rand() < maxLines/total--) {
print; maxLines--;
} else {
print $0 > /dev/fd/3
}
}' "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"
As you print a line to the output, you decrement maxLines to decrease the probability of choosing further lines. But as you consume the input, you decrease total to increase the probability. In the extreme, the probability hits zero when maxLines does, so you can stop processing the input. In the other extreme, the probability hits 1 once total is less than or equal to maxLines, and you'll be accepting all further lines.
Here's the same algorithm, implemented in (almost) pure bash using integer arithmetic:
FILEsrc=$1
FILEtrg=$2
MaxLines=$3
tmp=$(mktemp /tmp/XXXXXXXX)
total=$(wc -l < "$FILEsrc")
while read -r line && (( MaxLines > 0 )); do
(( MaxLines * 32768 > RANDOM * total-- )) || { printf >&3 "$line\n"; continue; }
(( MaxLines-- ))
printf "$line\n"
done < "$FILEsrc" > "$FILEtrg" 3> $tmp
mv $tmp "$FILEsrc"

Here's a complete Go program :
package main
import (
"bufio"
"fmt"
"log"
"math/rand"
"os"
"sort"
"time"
)
func main() {
N := 10
rand.Seed( time.Now().UTC().UnixNano())
f, err := os.Open(os.Args[1]) // open the file
if err!=nil { // and tell the user if the file wasn't found or readable
log.Fatal(err)
}
r := bufio.NewReader(f)
var lines []string // this will contain all the lines of the file
for {
if line, err := r.ReadString('\n'); err == nil {
lines = append(lines, line)
} else {
break
}
}
nums := make([]int, N) // creates the array of desired line indexes
for i, _ := range nums { // fills the array with random numbers (lower than the number of lines)
nums[i] = rand.Intn(len(lines))
}
sort.Ints(nums) // sorts this array
for _, n := range nums { // let's print the line
fmt.Println(lines[n])
}
}
Provided you put the go file in a directory named randomlines in your GOPATH, you may build it like this :
go build randomlines
And then call it like this :
./randomlines "path_to_my_file"
This will print N (here 10) random lines in your files, but without changing the order. Of course it's near instantaneous even with big files.

Here's an interesting two-pass option with coreutils, sed and awk:
n=5
total=$(wc -l < infile)
seq 1 $total | shuf | head -n $n \
| sed 's/^/NR == /; $! s/$/ ||/' \
| tr '\n' ' ' \
| sed 's/.*/ & { print >> "rndlines" }\n!( &) { print >> "leftover" }/' \
| awk -f - infile
A list of random numbers are passed to sed which generates an awk script. If awk were removed from the pipeline above, this would be the output:
{ if(NR == 14 || NR == 1 || NR == 11 || NR == 20 || NR == 21 ) print > "rndlines"; else print > "leftover" }
So the random lines are saved in rndlines and the rest in leftover.

Mentioned "10 hundreds" lines should sort quite quickly, so this is a nice case for the Decorate, Sort, Undecorate pattern. It actually creates two new files, removing lines from the original one can be simulated by renaming.
Note: head and tail cannot be used instead of awk, because they close the file descriptor after given number of lines, making tee exit thus causing missing data in the .rest file.
FILE=input.txt
SAMPLE=10
SEP=$'\t'
<$FILE nl -s $"SEP" -nln -w1 |
sort -R |
tee \
>(awk "NR > $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.rest) \
>(awk "NR <= $SAMPLE" | sort -t"$SEP" -k1n,1 | cut -d"$SEP" -f2- > $FILE.sample) \
>/dev/null
# check the results
wc -l $FILE*
# 'remove' the lines, if needed
mv $FILE.rest $FILE

This might work for you (GNU sed, sort and seq):
n=10
seq 1 $(sed '$=;d' input_file) |
sort -R |
sed $nq |
sed 's/.*/&{w output_file\nd}/' |
sed -i -f - input_file
Where $n is the number of lines to extract.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Bash: Loop Read N lines at time from CSV - linux

Related

how to loop through string for patterns from linux shell?

Interleave lines sorted by a column

sort fasta by sequence size

saving information as a variable in the middle of a series of pipes

How to select random lines from a file

Categories

Resources