I'm conducting a reiterative analysis and having to submit more than 5000 jobs to a batch system on a large computer cluster.
I'm wanting to run a bash for loop but call both the current list item and the next item in my script. I'm not sure the best way to do this using this format:
#! /bin/bash
for i in `cat list.txt`;
do
# run a bunch of code on $i (ex. Data file for 'Apples')
# compare input of $i with next in list {$i+1} (ex. Compare 'Apples' to 'Oranges', save output)
# take output of this comparison and use it as an input for the next analysis of $i (ex. analyze 'Apples' some more, save output for the next step, analyze data on 'Oranges')
# save this output as the input for next script which analyses the next item in the list {$i+1} (Analysis of 'Oranges' with input data from 'Apples', and comparing to 'Grapes' in the middle of the loop, etc., etc.)
done
Would it be easiest for me to provide a tabular input list in a while loop? I would really prefer not to do this as I would have to do some code editing, albeit minor.
Thanks for helping a novice -- I've looked all over the interwebs and ran through a bunch of books and haven't found a good way to do this.
EDIT: For some reason I was thinking there might have been a for loop trick to do this but I guess not; it's probably easier for me to do a while loop with a tabular input. I was prepared to do this, but I didn't want to re-write the code I had already.
UPDATE: Thank you all so much for your time and input! Greatly appreciated.
Another solution is to use bash arrays. For example, given a file list.txt with content:
1
2
3
4
4
5
You can create an array variable with the lines of the file as elements as:
$ myarray=(1 2 3 4 4 5)
While you could also do myarray=( $(echo list.txt) ) this may split on whitespace and handle other output inappropriately, the better method is:
$ IFS=$'\n' read -r -d '' -a myarray < list.txt
Then you can access elements as:
$ echo "${myarray[2]}"
3
To length of the array is given by ${#myarray[#]}. A list of all indices is given by ${!myarray[#]}, and you can loop over this list of indices:
for i in "${!myarray[#]}"; do
echo "${myarray[$i]} ${myarray[$(( $i + 1))]}"
done
Output:
1 2
2 3
3 4
4 4
4 5
5
While there are likely simpler solutions to your particular use case, this would allow you to access arbitrary combinations of array elements in the loop.
This answer assumes that you want your values to overlap -- meaning that a value given as next then becomes curr on the following iteration.
Assuming you encapsulate your code in a function that takes two arguments (current and next) when a next item exists, or one argument when on the last item:
# a "higher-order function"; it takes another function as its argument
# and calls that argument with current/next input pairs.
invokeWithNext() {
local funcName=$1
local curr next
read -r curr
while read -r next; do
"$funcName" "$curr" "$next"
curr=$next
done
"$funcName" "$curr"
}
# replace this with your own logic
yourProcess() {
local curr=$1 next=$2
if (( $# > 1 )); then
printf 'Current value is %q, and next item is %q\n' "$curr" "$next"
else
printf 'Current value is %q; no next item exists\n' "$curr"
fi
}
These definitions done, you can run:
invokeWithNext yourProcess <list.txt
...yield output such as:
Current value is 1, and next item is 2
Current value is 2, and next item is 3
Current value is 3, and next item is 4
Current value is 4, and next item is 5
Current value is 5; no next item exists
$ printf '%d\n' {0..10} | paste - -
0 1
2 3
4 5
6 7
8 9
10
So if you just want to interpolate lines so that you can read two variables per line...
while read -r odd even; do
…
done < <(paste - - < inputfile)
You will need to do additional work if your lines contain whitespace.
I would replace the for loop with a while read xx loop.
Something along the lines of
cat list.txt | while read line; do
if read nextline; then
# You have $line and $nextline
else
# You have garbage in $nextline and the last line of list.txt in $line
fi
done
Related
Question:
I have 2 files, file 1 is a TSV (BED) file that has 23 base-pair sequences in column 7, for example:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC
File 2 is a FASTA file (hg19.fasta) that looks like this. Although it breaks across the lines, this continous string of A,C,G, and T's reprsents a continous string (i.e. a chromsosome). This file is the entire human reference genome build 19, so the two > headers followed by sequences essentially occurs 23 times for each of the 23 chromosomes:
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
CAGAATGTACAGGTTTGTTACACAGGTATACACCTGCCATGGTTGTTTGCTGCACCCATCAACTCACCATCTACATTAGG
TATTTCTCCTAACGTTATCCCTCATGAATAAGTCAAGTAAATGGAC
>2 dna:chromosome chromosome:GRCh37:1:1:2492300:1
AATTTGACCAGAAGTTATGGGCATCCCTCCCCTGGGAAGGAGGCAGGCAGAAAAGTTTGGAATCTATGTAGTAAAATATG
TTACTCTTTTATATATATGAATAAGTCAAGTAAATGGACATACATATATGTGTGTATATGTGTATATATATATACACACA
TATATACATACATACATACATACATATTATCTGAATTAGGCCATGGTGCTTTGTTATGGCAGCTCTCTGGGATACATGTG
I want to 1) Find out how many times each 23bp sequence appears in the second file without overlapping any others including sequences that break across the lines and 2) append this number to a new column next to the sequence so the new file looks like this:
Desired Output:
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
My attempt:
I imagine solving the first part will be some variation on grep, so far I've managed:
grep -o ATGGTGCTTTGTTATGGCAGCTC "$file_2" | grep -c ""
which gets the count of a specific sequence, but not each sequence in the column. I think appending the grep results will require awk and paste but I haven't gotten that far!
Any help is appreciated as always! =)
Updates and Edits:
The actual size of these files is relatively massive (30mb or ~500,000 lines for each tsv/BED file) and the FASTA file is the entire human reference genome build 19, which is ~60,000,000 lines. The perl solution proposed by #choroba works, but doesn't scale well to these sizes.
Unfortunately, because of the need to identify matches across the lines, the awk and bash/grep solutions memtnioned below won't work.
I want multiple non-overlapping hits in the same chromosome to count as the actual number of hits. I.e. If you search for a sequence and get 2 hits in a single chromosome and 1 in another chromosome, the total count should be 3.
Ted Lyngmo is very kindly helping me develop a solution in C++ that allows this to be run in a realistic timeframe, there's more detail on his post in this thread. And link to the Github for this is here =)
If the second file is significantly bigger than the first one, I would try this awk script:
awk 'v==1 {a[$7];next} # Get the pattern from first file into the array a
v==2 { # For each line of the second file
for(i in a){ # Loop though all patterns
a[i]=split($0,b,i)-1 # Get the number of pattern match in the line
}
}
v==3 {print $0,a[$7]} # Re-read first file to add the number of pattern matches
' v=1 file1 v=2 file2 v=3 file1
I'd reach for a programming language like Perl.
#!/usr/bin/perl
use warnings;
use strict;
my ($fasta_file, $bed_file) = #ARGV;
open my $fasta, '<', $fasta_file or die "$fasta_file: $!";
open my $bed, '<', $bed_file or die "$bed_file: $!";
my $seq;
while (<$fasta>) {
$seq .= "\n", next if /^>/;
chomp;
$seq .= $_;
}
while (<$bed>) {
chomp;
my $short_seq = (split /\t/, $_)[-1];
my $count = () = $seq =~ /\Q$short_seq\E/g;
print "$_\t$count\n";
}
To count overlapping sequences, change the regex to a lookahead.
my $count = () = $seq =~ /(?=\Q$short_seq\E)/g;
Since grep -c seems to give you the correct count (matching lines, not counting multiple occurances on the same line) you could read the 7 fields from the TSV (BED) file and just print them again with the grep output added to the end:
#!/bin/bash
# read the fields into the array `v`:
while read -ra v
do
# print the 7 first elements in the array + the output from `grep -c`:
echo "${v[#]:0:7}" "$(grep -Fc "${v[6]}" hg19.fasta)"
done < tsv.bed > outfile
outfile will now contain
1 779692 779715 Sample_3 + 1 ATGGTGCTTTGTTATGGCAGCTC 1
1 783462 783485 Sample_4 - 1 ATGAATAAGTCAAGTAAATGGAC 2
Benchmarks
This table is a comparison of the three different solutions presented as answers here, with timings to finish different amount of tsv/bed records with the full hg19.fa file (excluding the records containing only N:s). hg19.fa contains 57'946'726 such records. As a baseline I've used two versions of a C++ program (called hgsearch/hgsearchmm). hgsearch reads the whole hg19.fa file into memory and then searches it in parallel. hgsearchmm uses a memory mapped file instead and then searches that (also in parallel).
search \ beds
1
2
100
1000
10000
awk
1m0.606s
17m19.899s
-
-
-
perl
13.263s
15.618s
4m48.751s
48m27.267s
-
bash/grep
2.088s
3.670s
3m27.378s
34m41.129s
-
hgsearch
8.776s
9.425s
30.619s
3m56.508s
38m43.984s
hgsearchmm
1.942s
2.146s
21.715s
3m28.265s
34m56.783s
The tests were run on an Intel Core i9 12 cores/24 HT:s in WSL/Ubuntu 20.04 (SSD disk).
The sources for the scripts and baseline programs used can be found here
Is there a way to remove both duplicates and redundant substrings from a list, using shell tools? By "redundant", I mean a string that is contained within another string, so "foo" is redundant with "foobar" and "barfoo".
For example, take this list:
abcd
abc
abd
abcd
bcd
and return:
abcd
abd
uniq, sort -u and awk '!seen[$0]++' remove duplicates effectively but not redundant strings:
How to delete duplicate lines in a file without sorting it in Unix?
Remove duplicate lines without sorting
I can loop through each line recursively with grep but this is is quite slow for large files. (I have about 10^8 lines to process.)
There's an approach using a loop in Python here: Remove redundant strings based on partial strings and Bash here: How to check if a string contains a substring in Bash but I'm trying to avoid loops. Edit: I mean nested loops here, thanks for the clarification #shellter
Is there a way to use a awk's match() function with an array index? This approach builds the array progressively so never has to search the whole file, so should be faster for large files. Or am I missing some other simple solution?
An ideal solution would allow matching of a specified column, as for the methods above.
EDIT
Both of the answers below work, thanks very much for the help. Currently testing for performance on a real dataset, will update with results and accept an answer. I tested both approaches on the same input file, which has 430,000 lines, of which 417,000 are non-redundant. For reference, my original looped grep approach took 7h30m with this file.
Update:
James Brown's original solution took 3h15m and Ed Morton's took 8h59m. On a smaller dataset, James's updated version was 7m versus the original's 20m. Thank you both, this is really helpful.
The data I'm working with are around 110 characters per string, with typically hundreds of thousands of lines per file. The way in which these strings (which are antibody protein sequences) are created can lead to characters from one or both ends of the string getting lost. Hence, "bcd" is likely to be a fragment of "abcde".
An awk that on first run extracts and stores all substrings and strings to two arrays subs and strs and checks on second run:
$ awk '
NR==FNR { # first run
if(($0 in strs)||($0 in subs)) # process only unseen strings
next
len=length()-1 # initial substring length
strs[$0] # hash the complete strings
while(len>=1) {
for(i=1;i+len-1<=length();i++) { # get all substrings of current len
asub=substr($0,i,len) # sub was already resetved :(
if(asub in strs) # if substring is in strs
delete strs[asub] # we do not want it there
subs[asub] # hash all substrings too
}
len--
}
next
}
($0 in strs)&&++strs[$0]==1' file file
Output:
abcd
abd
I tested the script with about 30 M records of 1-20 char ACGT strings. The script ran 3m27s and used about 20 % of my 16 GBs. Using strings of length 1-100 I OOM'd in a few mins (tried it again with about 400k records oflength of 50-100 and it uses about 200 GBs and runs about an hour). (20 M records of 1-30 chars ran 7m10s and used 80 % of the mem)
So if your data records are short or you have unlimited memory, my solution is fast but in the opposite case it's going to crash running out of memory.
Edit:
Another version that tries to preserve memory. On the first go it checks the min and max lengths of strings and on the second run won't store substrings shorter than global min. For about 400 k record of length 50-100 it used around 40 GBs and ran 7 mins. My random data didn't have any redundancy so input==putput. It did remove redundance with other datasets (2 M records of 1-20 char strings):
$ awk '
BEGIN {
while((getline < ARGV[1])>0) # 1st run, check min and max lenghts
if(length()<min||min=="") # TODO: test for length()>0, too
min=length()
else if(length()>max||max=="")
max=length()
# print min,max > "/dev/stderr" # debug
close(ARGV[1])
while((getline < ARGV[1])>0) { # 2nd run, hash strings and substrings
# if(++nr%10000==0) # debug
# print nr > "/dev/stderr" # debug
if(($0 in strs)||($0 in subs))
continue
len=length()-1
strs[$0]
while(len>=min) {
for(i=1;i+len-1<=length();i++) {
asub=substr($0,i,len)
if(asub in strs)
delete strs[asub]
subs[asub]
}
len--
}
}
close(ARGV[1])
while((getline < ARGV[1])>0) # 3rd run, output
if(($0 in strs)&&!strs[$0]++)
print
}' file
$ awk '{print length($0), $0}' file |
sort -k1,1rn -k2 -u |
awk '!index(str,$2){str = str FS $2; print $2}'
abcd
abd
The above assumes the set of unique values will fit in memory.
EDIT
This won't work. Sorry.
#Ed's solution is the best idea I can imagine without some explicit looping, and even that is implicitly scanning over the near-entire growing history of data on every record. It has to.
Can your existing resources hold that whole column in memory, plus a delimiter per record? If not, then you're going to be stuck with either very complex optimization algorithms, or VERY slow redundant searches.
Original post left for reference in case it gives someone else an inspiration.
That's a lot of data.
Given the input file as-is,
while read next
do [[ "$last" == "$next" ]] && continue # throw out repeats
[[ "$last" =~ $next ]] && continue # throw out sustrings
[[ "$next" =~ $last ]] && { last="$next"; continue; } # upgrade if last a substring of next
echo $last # distinct string
last="$next" # set new key
done < file
yields
abcd
abd
With a file of that size I wouldn't trust that sort order, though. Sorting is going to be very slow and take a lot of resources, but will give you more trustworthy results. If you can sort the file once and use that output as the input file, great. If not, replace that last line with done < <( sort -u file ) or something to that effect.
Reworking this logic in awk will be faster.
$: sort -u file | awk '1==NR{last=$0} last~$0{next} $0~last{last=$0;next} {print last;last=$0}'
Aside from the sort this uses trivial memory and should be very fast and efficient, for some value of "fast" on a file with 10^8 lines.
This question already has answers here:
Shell command to sum integers, one per line?
(45 answers)
Closed 5 years ago.
I'm currently trying to write a function in my bash script that does the following: Take in a file as an argument and calculate the sum of the numbers within that file. I must make use of a for loop and the bc command.
Example of values in the file (each value on their own line):
12
4
53
19
6
So here's what I have so far:
function sum_values() {
for line in $1; do
#not sure how to sum the values using bc
done
}
Am I on the right track? I'm not sure how to implement the bc command in this situation.
You can do it easily without the need of a for loop.
paste -s -d+ numbers.txt | bc
You are not on track. Why?
You are passing the whole file content as a variable which requires to store the whole file in memory. Not a problem with a 1, 2, 3 example, big no go in real life.
You are iterating over the content of a file using a for in loop assuming that you are iterating over the lines of that file. That is not true, because word splitting will be performed which makes the for in loop literally iterate over words, not lines.
As others mentioned, the shell is not the right tool for it. That's because such kind of processing is very slow with the shell compared to awk for example. Furthermore the shell is not able to perform floating point operations, meaning you can only process integers. Use awk.
Correct would be (with bash, for educational purposes):
# Expects a filename
sum() {
filename=${1}
s=0
while read -r line ; do
# Arithmetic expansion
s=$((s+line))
# Or with bc
# s=$(bc <<< "${s}+${line}")
# With floating point support
# s=$(bc -l <<< "${s}+${line}")
done < "${filename}"
echo "${s}"
}
sum filename
With awk:
awk '{s+=$0}END{print s}' filename
While awk (or other higher level language: perl, python, etc) would be better suited for this task, you are on the right track for doing it the naive way. Tip:
$ x=1
$ y=$(bc <<<"$x + 1")
$ echo $y
2
To do math in bash we surround an operation in $(( ... ))
Here are some examples:
$(( 5 + 5 )) # 10
my_var = $((5 + 5)) # my_var is now 10
my_var = $(($my_var + 5)) # my_var is now 10
Solution to your problem:
function sum_values() {
sum=0
for i in $(<$1); do
sum=$(($sum + $i))
done
echo $sum
}
Note that you could have also done $(cat $1) instead of $(<$1) in the solution above.
Edit: Replaced return $sum with echo $sum
I have a list file with many different columns, and want to know whenever the next value in the third column is equal to the current value on third column. My problem is, those values are directories, looking like this:
/media/user/DATA/Folder1/File1.extension
/media/user/DATA/Folder1/File1.extension
/media/user/DATA/Folder1/File2.extension
/media/user/DATA/Folder1/File2.extension
/media/user/DATA/Folder2/File3.extension
/media/user/DATA/Folder2/File4.extension
/media/user/DATA/Folder2/File5.extension
/media/user/DATA/Folder3/File6.extension
/media/user/DATA/Folder4/File6.extension
I have files with the same name in the same folder (which should be considered as an equal value), same name in different folders (must be recognized as different values) and same folder but different names (different values).
What I had been trying to do was to sort them all (with sort -k 3,3), then make a fourth column saying whether the next value is the same as the one in the column or not, looking something like this:
/media/user/DATA/Folder1/File1.extension Y
/media/user/DATA/Folder1/File1.extension N
/media/user/DATA/Folder1/File2.extension Y
/media/user/DATA/Folder1/File2.extension N
/media/user/DATA/Folder2/File3.extension N
/media/user/DATA/Folder2/File4.extension N
/media/user/DATA/Folder2/File5.extension N
/media/user/DATA/Folder3/File6.extension N
/media/user/DATA/Folder4/File6.extension N
I came up with this code to do that, but it keeps showing errors when comparing R1 with LastR1, which I think may be due to the "/" in the strings...
echo "N" > LastR1.kp
sort -k3,3 list > list.tmp
mv list list.backup
mv list.tmp list
while read T1 T2 R1; do
LastR1=`cat LastR1.kp`
if [ $R1 == $LastR1 ]
then
KeepR1="Y"
else
KeepR1="N"
fi
echo "$KeepR1" > CKeep.kp
cat KeepList.kp CKeep.kp > KeepList.kp
done
sed -i -e 1,2d KeepList
join list KeepList > list.tmp
mv list.tmp list
So, what I would have at the end, would be my original file, with a fourth column with either Y (for 'next row has the same value in 3rd column) or N (for 'different value on the next row').
I can't seem to find the reason why my if statement doesn't work - and, even though I think this approach could do what I need, I'm definitely open to different approaches
Not sure if you are using bash or sh. Add the shebang line on top of your script:
#!/usr/bin/env bash
and change the if condition to use [[ ]]:
if [[ $R1 == $LastR1 ]]
I hope this would solve your problem.
I am trying to write a shell script to average several identically formatted files with names file1, file2, file3 and so on.
In each file, the data is in a table of a format like for example 4 columns and 5 rows of data. Let's assume file1, file2 and file3 are in the same directory. What I want to do is to create an average file, which has the same format as file1/file2/file3 where it should have the average of the each element from the table. For example,
{(Element in row 1, column 1 in file1)+
(Element in row 1, column 1 in file2)+
(Element in row 1, column 1 in file3)} >>
(Element in row 1, column 1 in average file)
Likewise, I need to do it for each element in the table, the average file has the same number of elements as the file1, file2, file3.
I tried to write a shell script, but it doesn't work. What I want is to read the files in a loop and grep the same element from each file, add them and average them over the number of files and finally write to a similar file format. This is what I tried to write:
#!/bin/bash
s=0
for i in {1..5..1} do
for j in {1..4..1} do
for f in m* do
a=$(awk 'FNR == i {print $j}' $f)
echo $a
s=$s+$a
echo $f
done
avg=$s/3
echo $avg > output
done
done
This is a rather inefficient way of going about it: for every single number you're trying to extract, you process one of the input files completely – even though you only have three files, you process 60!
Also, mixing Bash and awk in this way is a massive antipattern. This here is a great Q&A explaining why.
A few more remarks:
For brace expansion, the default step size is 1, so {1..4..1} is the same as {1..4}.
Awk has no clue what i and j are. As far as it is concerned, those were never defined. If you really wanted to get your shell variables into awk, you could do
a=$(awk -v i="$i" -v j="$j" 'FNR == i { print $j }' $f)
but the approach is not sound anyway.
Shell arithmetic does not work like s=$s+$a or avg=$s/3 – these are just concatenating strings. To have the shell do calculations for you, you'd need arithmetic expansion:
s=$(( s + a ))
or, a little shorter,
(( s += a ))
and
avg=$(( s / 3 ))
Notice that you don't need the $ signs in an arithmetic context.
echo $avg > output would print every number on a separate line, which is probably not what you want.
Indentation matters! If not for the machine, then for human readers.
A Bash solution
This solves the problem using just Bash. It is hard coded to three files, but flexible in the number of lines and elements per line. There are no checks to make sure that the number of elements is the same for all lines and files.
Notice that Bash is not fast at that kind of thing and should only be used for small files, if at all. Also, is uses integer arithmetic, so the "average" of 3 and 4 would become 3.
I've added comments to explain what happens.
#!/bin/bash
# Read a line from the first file into array arr1
while read -a arr1; do
# Read a line from the second file at file descriptor 3 into array arr2
read -a arr2 <&3
# Read a line from the third file at file descriptor 4 into array arr3
read -a arr3 <&4
# Loop over elements
for (( i = 0; i < ${#arr1[#]}; ++i )); do
# Calculate average of element across files, assign to res array
res[i]=$(( (arr1[i] + arr2[i] + arr3[i]) / 3 ))
done
# Print res array
echo "${res[#]}"
# Read from files supplied as arguments
# Input for the second and third file is redirected to file descriptors 3 and 4
# to enable looping over multiple files concurrently
done < "$1" 3< "$2" 4< "$3"
This has to be called like
./bashsolution file1 file2 file3
and output can be redirected as desired.
An awk solution
This is a solution in pure awk. It's a bit more flexible in that it takes the average of however many files are supplied as arguments; it should also be faster than the Bash solution by about an order of magnitude.
#!/usr/bin/awk -f
# Count number of files: increment on the first line of each new file
FNR == 1 { ++nfiles }
{
# (Pseudo) 2D array summing up fields across files
for (i = 1; i <= NF; ++i) {
values[FNR, i] += $i
}
}
END {
# Loop over lines of array with sums
for (i = 1; i <= FNR; ++i) {
# Loop over fields of current line in array of sums
for (j = 1; j <= NF; ++j) {
# Build record with averages
$j = values[i, j]/nfiles
}
print
}
}
It has to be called like
./awksolution file1 file2 file3
and, as mentioned, there is no limit to the number of files to average over.