Average values across multiple files - linux

I am trying to write a shell script to average several identically formatted files with names file1, file2, file3 and so on.
In each file, the data is in a table of a format like for example 4 columns and 5 rows of data. Let's assume file1, file2 and file3 are in the same directory. What I want to do is to create an average file, which has the same format as file1/file2/file3 where it should have the average of the each element from the table. For example,
{(Element in row 1, column 1 in file1)+
(Element in row 1, column 1 in file2)+
(Element in row 1, column 1 in file3)} >>
(Element in row 1, column 1 in average file)
Likewise, I need to do it for each element in the table, the average file has the same number of elements as the file1, file2, file3.
I tried to write a shell script, but it doesn't work. What I want is to read the files in a loop and grep the same element from each file, add them and average them over the number of files and finally write to a similar file format. This is what I tried to write:
#!/bin/bash
s=0
for i in {1..5..1} do
for j in {1..4..1} do
for f in m* do
a=$(awk 'FNR == i {print $j}' $f)
echo $a
s=$s+$a
echo $f
done
avg=$s/3
echo $avg > output
done
done

This is a rather inefficient way of going about it: for every single number you're trying to extract, you process one of the input files completely – even though you only have three files, you process 60!
Also, mixing Bash and awk in this way is a massive antipattern. This here is a great Q&A explaining why.
A few more remarks:
For brace expansion, the default step size is 1, so {1..4..1} is the same as {1..4}.
Awk has no clue what i and j are. As far as it is concerned, those were never defined. If you really wanted to get your shell variables into awk, you could do
a=$(awk -v i="$i" -v j="$j" 'FNR == i { print $j }' $f)
but the approach is not sound anyway.
Shell arithmetic does not work like s=$s+$a or avg=$s/3 – these are just concatenating strings. To have the shell do calculations for you, you'd need arithmetic expansion:
s=$(( s + a ))
or, a little shorter,
(( s += a ))
and
avg=$(( s / 3 ))
Notice that you don't need the $ signs in an arithmetic context.
echo $avg > output would print every number on a separate line, which is probably not what you want.
Indentation matters! If not for the machine, then for human readers.
A Bash solution
This solves the problem using just Bash. It is hard coded to three files, but flexible in the number of lines and elements per line. There are no checks to make sure that the number of elements is the same for all lines and files.
Notice that Bash is not fast at that kind of thing and should only be used for small files, if at all. Also, is uses integer arithmetic, so the "average" of 3 and 4 would become 3.
I've added comments to explain what happens.
#!/bin/bash
# Read a line from the first file into array arr1
while read -a arr1; do
# Read a line from the second file at file descriptor 3 into array arr2
read -a arr2 <&3
# Read a line from the third file at file descriptor 4 into array arr3
read -a arr3 <&4
# Loop over elements
for (( i = 0; i < ${#arr1[#]}; ++i )); do
# Calculate average of element across files, assign to res array
res[i]=$(( (arr1[i] + arr2[i] + arr3[i]) / 3 ))
done
# Print res array
echo "${res[#]}"
# Read from files supplied as arguments
# Input for the second and third file is redirected to file descriptors 3 and 4
# to enable looping over multiple files concurrently
done < "$1" 3< "$2" 4< "$3"
This has to be called like
./bashsolution file1 file2 file3
and output can be redirected as desired.
An awk solution
This is a solution in pure awk. It's a bit more flexible in that it takes the average of however many files are supplied as arguments; it should also be faster than the Bash solution by about an order of magnitude.
#!/usr/bin/awk -f
# Count number of files: increment on the first line of each new file
FNR == 1 { ++nfiles }
{
# (Pseudo) 2D array summing up fields across files
for (i = 1; i <= NF; ++i) {
values[FNR, i] += $i
}
}
END {
# Loop over lines of array with sums
for (i = 1; i <= FNR; ++i) {
# Loop over fields of current line in array of sums
for (j = 1; j <= NF; ++j) {
# Build record with averages
$j = values[i, j]/nfiles
}
print
}
}
It has to be called like
./awksolution file1 file2 file3
and, as mentioned, there is no limit to the number of files to average over.

Related

How to swap the even and odd lines via bash script?

In the incoming string stream from the standard input, swap the even and odd lines.
I've tried to do it like this, but reading from file and $i -lt $a.count aren't working:
$a= gc test.txt
for($i=0;$i -lt $a.count;$i++)
{
if($i%2)
{
$a[$i-1]
}
else
{
$a[$i+1]
}
}
Please, help me to get this working
Suggesting one line awk script:
awk '!(NR%2){print$0;print r}NR%2{r=$0}' input.txt
awk script explanation
!(NR % 2){ # if row number divide by 2 witout reminder
print $0; # print current row
print evenRow; # print saved row
}
(NR % 2){ # if row number divided by 2 with reminder
evenRow = $0; # save current row in variable
}
There's already a good awk-based answer to the question here, and there are a few decent-looking sed and awk solutions in Swap every 2 lines in a file. However, the shell solution in Swap every 2 lines in a file is almost comically incompetent. The code below is an attempt at a functionally correct pure Bash solution. Bash is very slow, so it is practical to use only on small files (maybe up to 10 thousand lines).
#! /bin/bash -p
idx=0 lines=()
while IFS= read -r 'lines[idx]' || [[ -n ${lines[idx]} ]]; do
(( idx == 1 )) && printf '%s\n%s\n' "${lines[1]}" "${lines[0]}"
idx=$(( (idx+1)%2 ))
done
(( idx == 1 )) && printf '%s\n' "${lines[0]}"
The lines array is used to hold two consecutive lines.
IFS= prevents whitespace being stripped as lines are read.
The idx variable cycles through 0, 1, 0, 1, ... (idx=$(( (idx+1)%2 ))) so reading to lines[idx] cycles through putting input lines at indexes 0 and 1 in the lines array.
The read function returns non-zero status immediately if it encounters an unterminated final line in the input. That could cause the loop to terminate before processing the last line, thus losing it in the output. The || [[ -n ${lines[idx]} ]] avoids that by checking if the read actually read some input. (Fortunately, there's no such thing as an unterminated empty line at the end of a file.)
printf is used instead of echo to output the lines because echo doesn't work reliably for arbitrary strings. (For instance, a line containing just -n would get lost completely by echo "$line".) See Why is printf better than echo?.
The question doesn't say what to do if the input has an odd number of lines. This code ((( idx == 1 )) && printf '%s\n' "${lines[0]}") just passes the last line through, after the swapped lines. Other reasonable options would be to drop the last line or print it preceded by an extra blank line. The code can easily be modified to do one of those if desired.
The code is Shellcheck-clean.

How to save in two columns of the same file from different output in bash

I am working on a project that require me to take some .bed in input, extract one column from each file, take only certain parameters and count how many of them there are for each file. I am extremely inexperienced with bash so I don't know most of the commands. But with this line of code it should do the trick.
for FILE in *; do cat $FILE | awk '$9>1.3'| wc -l ; done>/home/parallels/Desktop/EP_Cell_Type.xls
I saved those values in a .xls since I need to do some graphs with them.
Now I would like to take the filenames with -ls and save them in the first column of my .xls while my parameters should be in the 2nd column of my excel file.
I managed to save everything in one column with the command:
ls>/home/parallels/Desktop/EP_Cell_Type.xls | for FILE in *; do cat $FILE | awk '$9>1.3'-x| wc -l ; done >>/home/parallels/Desktop/EP_Cell_Type.xls
My sample files are:A549.bed, GM12878.bed, H1.bed, HeLa-S3.bed, HepG2.bed, Ishikawa.bed, K562.bed, MCF-7.bed, SK-N-SH.bed and are contained in a folder with those files only.
The output is the list of all filenames and the values on the same column like this:
Column 1
A549.bed
GM12878.bed
H1.bed
HeLa-S3.bed
HepG2.bed
Ishikawa.bed
K562.bed
MCF-7.bed
SK-N-SH.bed
4536
8846
6754
14880
25440
14905
22721
8760
28286
but what I need should be something like this:
Filenames
#BS
A549.bed
4536
GM12878.bed
8846
H1.bed
6754
HeLa-S3.bed
14880
HepG2.bed
25440
Ishikawa.bed
14905
K562.bed
22721
MCF-7.bed
8760
SK-N-SH.bed
28286
Assuming OP's awk program (correctly) finds all of the desired rows, an easier (and faster) solution can be written completely in awk.
One awk solution that keeps track of the number of matching rows and then prints the filename and line count:
awk '
FNR==1 { if ( count >= 1 ) # first line of new file? if line counter > 0
printf "%s\t%d\n", prevFN, count # then print previous FILENAME + tab + line count
count=0 # then reset our line counter
prevFN=FILENAME # and save the current FILENAME for later printing
}
$9>1.3 { count++ } # if field #9 > 1.3 then increment line counter
END { if ( count >= 1 ) # flush last FILENAME/line counter to stdout
printf "%s\t%d\n", prevFN, count
}
' * # * ==> pass all files as input to awk
For testing purposes I replaced $9>1.3 with /do/ (match any line containing the string 'do') and ran against a directory containing an assortment of scripts and data files. This generated the following tab-delimited output:
bigfile.txt 7
blocker_tree.sql 4
git.bash 2
hist.bash 4
host.bash 2
lines.awk 2
local.sh 3
multi_file.awk 2

Filter a very large, numerically sorted CSV file based on a minimum/maximum value using Linux?

I'm trying to output lines of a CSV file which is quite large. In the past I have tried different things and ultimately come to find that Linux's command line interface (sed, awk, grep, etc) is the fastest way to handle these types of files.
I have a CSV file like this:
1,rand1,rand2
4,randx,randy,
6,randz,randq,
...
1001,randy,randi,
1030,rando,randn,
1030,randz,randc,
1036,randp,randu
...
1230994,randm,randn,
1230995,randz,randl,
1231869,rande,randf
Although the first column is numerically increasing, the space between each number varies randomly. I need to be able to output all lines that have a value between X and Y in their first column.
Something like:
sed ./csv -min --col1 1000 -max --col1 1400
which would output all the lines that have a first column value between 1000 and 1400.
The lines are different enough that in a >5 GB file there might only be ~5 duplicates, so it wouldn't be a big deal if it counted the duplicates only once -- but it would be a big deal if it threw an error due to a duplicate line.
I may not know whether particular line values exist (e.g. 1000 is a rough estimate and should not be assumed to exist as a first column value).
Optimizations matter when it comes to large files; the following awk command:
is parameterized (uses variables to define the range boundaries)
performs only a single comparison for records that come before the range.
exits as soon as the last record of interest has been found.
awk -F, -v from=1000 -v to=1400 '$1 < from { next } $1 > to { exit } 1' ./csv
Because awk performs numerical comparison (with input fields that look like numbers), the range boundaries needn't match field values precisely.
You can easily do this with awk, though it won't take full advantage of the file being sorted:
awk -F , '$1 > 1400 { exit(0); } $1 >= 1000 { print }' file.csv
If you know that the numbers are increasing and unique, you can use addresses like this:
sed '/^1000,/,/^1400,/!d' infile.csv
which does not print any line that is outside of the lines between the one that matches /^1000,/ and the one that matches /^1400,/.
Notice that this doesn't work if 1000 or 1400 don't actually exist as values, i.e., it wouldn't print anything at all in that case.
In any case, as demonstrated by the answers by mklement0 and that other guy, awk is a the better choice here.
Here's a bash-version of the script:
#! /bin/bash
fname="$1"
start_nr="$2"
end_nr="$3"
while IFS=, read -r nr rest || [[ -n $nr && -n $rest ]]; do
if (( $nr < $start_nr )); then continue;
elif (( $nr > $end_nr )); then break; fi
printf "%s,%s\n" "$nr" "$rest"
done < "$fname"
Which you would then call script.sh foo.csv 1000 2000
The script will start printing when the number is large enough and then immediately stops when the number gets above the limit.

Calling the current and next item in a bash for loop

I'm conducting a reiterative analysis and having to submit more than 5000 jobs to a batch system on a large computer cluster.
I'm wanting to run a bash for loop but call both the current list item and the next item in my script. I'm not sure the best way to do this using this format:
#! /bin/bash
for i in `cat list.txt`;
do
# run a bunch of code on $i (ex. Data file for 'Apples')
# compare input of $i with next in list {$i+1} (ex. Compare 'Apples' to 'Oranges', save output)
# take output of this comparison and use it as an input for the next analysis of $i (ex. analyze 'Apples' some more, save output for the next step, analyze data on 'Oranges')
# save this output as the input for next script which analyses the next item in the list {$i+1} (Analysis of 'Oranges' with input data from 'Apples', and comparing to 'Grapes' in the middle of the loop, etc., etc.)
done
Would it be easiest for me to provide a tabular input list in a while loop? I would really prefer not to do this as I would have to do some code editing, albeit minor.
Thanks for helping a novice -- I've looked all over the interwebs and ran through a bunch of books and haven't found a good way to do this.
EDIT: For some reason I was thinking there might have been a for loop trick to do this but I guess not; it's probably easier for me to do a while loop with a tabular input. I was prepared to do this, but I didn't want to re-write the code I had already.
UPDATE: Thank you all so much for your time and input! Greatly appreciated.
Another solution is to use bash arrays. For example, given a file list.txt with content:
1
2
3
4
4
5
You can create an array variable with the lines of the file as elements as:
$ myarray=(1 2 3 4 4 5)
While you could also do myarray=( $(echo list.txt) ) this may split on whitespace and handle other output inappropriately, the better method is:
$ IFS=$'\n' read -r -d '' -a myarray < list.txt
Then you can access elements as:
$ echo "${myarray[2]}"
3
To length of the array is given by ${#myarray[#]}. A list of all indices is given by ${!myarray[#]}, and you can loop over this list of indices:
for i in "${!myarray[#]}"; do
echo "${myarray[$i]} ${myarray[$(( $i + 1))]}"
done
Output:
1 2
2 3
3 4
4 4
4 5
5
While there are likely simpler solutions to your particular use case, this would allow you to access arbitrary combinations of array elements in the loop.
This answer assumes that you want your values to overlap -- meaning that a value given as next then becomes curr on the following iteration.
Assuming you encapsulate your code in a function that takes two arguments (current and next) when a next item exists, or one argument when on the last item:
# a "higher-order function"; it takes another function as its argument
# and calls that argument with current/next input pairs.
invokeWithNext() {
local funcName=$1
local curr next
read -r curr
while read -r next; do
"$funcName" "$curr" "$next"
curr=$next
done
"$funcName" "$curr"
}
# replace this with your own logic
yourProcess() {
local curr=$1 next=$2
if (( $# > 1 )); then
printf 'Current value is %q, and next item is %q\n' "$curr" "$next"
else
printf 'Current value is %q; no next item exists\n' "$curr"
fi
}
These definitions done, you can run:
invokeWithNext yourProcess <list.txt
...yield output such as:
Current value is 1, and next item is 2
Current value is 2, and next item is 3
Current value is 3, and next item is 4
Current value is 4, and next item is 5
Current value is 5; no next item exists
$ printf '%d\n' {0..10} | paste - -
0 1
2 3
4 5
6 7
8 9
10
So if you just want to interpolate lines so that you can read two variables per line...
while read -r odd even; do
…
done < <(paste - - < inputfile)
You will need to do additional work if your lines contain whitespace.
I would replace the for loop with a while read xx loop.
Something along the lines of
cat list.txt | while read line; do
if read nextline; then
# You have $line and $nextline
else
# You have garbage in $nextline and the last line of list.txt in $line
fi
done

How can I split a large file with tab-separated values into smaller files while keeping lines together in a single file based on the first value?

I have a file (currently about 1 GB, 40M lines), and I need to split it into about smaller files based on a target file size (target is ~1 MB per file).
The file contains multiple lines of tab-separated values. The first column has an integer value. The file is sorted by the first column. There are about 1M values in the first column, so each value has on average 40 lines, but some may have 2 and others may have 100 or more lines.
12\t...
12\t...
13\t...
14\t...
15\t...
15\t...
15\t...
16\t...
...
2584765\t...
2586225\t...
2586225\t...
After splitting the file, any distinct first value must only appear in a single file. E.g. when I read a smaller file and find a line starting with 15, it is guaranteed that no other files contain lines starting with 15.
This does not mean map all lines that start with a specific value to a single file.
Is this possible with the commandline tools available on a Unix/Linux system?
The following will try to split every 40,000 records, but postpone the split if the next record has the same key as the previous.
awk -F '\t' 'BEGIN { i=1; s=0; f=sprintf("file%05i", i) }
NR % 40000 == 0 { s=1 }
s==1 && $1!=k { close(f); f=sprintf("file%05i", ++i); s=0 }
{ k=$1; print >>f }' input
List all the keys by looking at only the first column awk and making them unique sort -u. Then for each of these keys, select only the lines that start with the key grep and redirect this into a file named after the key.
Oneliner:
for key in `awk '{print $1;}' file_to_split | sort -u` ; do grep -e "^$key\\s" file_to_split > splitted_file_$key ; done
Or multiple lines for a script file and better readability:
for key in `awk '{print $1;}' file_to_split | sort -u`
do
grep -e "^$key\\s" file_to_split > splitted_file_$key
done
Not especially efficient as it parses the files many times.
Also not sure the for command being able to use such a large input from the `` subcommand.
On unix systems you also can usually use perl. So here is a perl solution:
#!/usr/local/bin/perl
use strict;
my $last_key;
my $store;
my $c = 0;
my $max_size = 1000000;
while(<>){
my #fields = split(/\t/);
my $key = $fields[0];
if ($last_key ne $key) {
store() if (length($store)>$max_size);
}
$store.=$_;
$last_key = $key;
}
store();
sub store {
$c++;
open (O, ">", "${c}.part");
print O $store;
close O;
$store='';
}
save it as x.pl.
use it like:
x.pl bigfile.txt
It sorts your entries into
1.part
2.part
...
files and tries to keep them around $max_size.
HTH

Resources