Write the contents of the variable to a file - linux

How can I save the contents of the variable sum in this operation?
$ seq 1 5 | awk '{sum+=$1} end {print sum; echo "$sum" > test_file}'

It looks like you're confusing BASH syntax and Awk. Awk is a programming language, and it has very different syntax from BASH.
$ seq 1 5 | awk '{ sum += $1 } END { print sum }'
15
You want to capture that 15 into a file:
$ seq 1 5 | awk '{ sum += $1 } END { print sum }' > test_file
That is using the shell's redirection. The > appears outside of the Awk program where the shell has control, and redirects standard out into the file test_file.
You can also redirect inside of Awk, but this is Awk's redirection. However, it uses the same syntax as BASH:
$ seq 1 5 | awk '{ sum += $1 } END { print sum > "test_file" }'
Note that the file name has to be quoted, or Awk will assume that test_file is a variable, and you'll get some error about redirecting to a null file name.

To write your output into a file, you have to redirect to "test_file" like this:
$ seq 5 | awk '{sum+=$1} END{print sum > "test_file"}'
$ cat test_file
15
Your version was not working because you were not quoting test_file, so for awk it was considered a variable. And as you have not defined it beforehand, awk couldn't redirect properly. David W's answer explains it pretty well.
Note also that seq 5 is equivalent to seq 1 5.
In case you want to save the result into a variable, you can use the var=$(command) syntax:
$ sum=$(seq 5 | awk '{sum+=$1} END{print sum}')
$ echo $sum
15

echo won't work in the awk command. Try this:
seq 1 5 | awk '{sum+=$1} END {print sum > "test_file"}

You don't need awk for this. You can say:
$ seq 5 | paste -sd+ | bc > test_file
$ cat test_file
15

This question is tagged with bash so here is a pure bash solution:
for ((i=1; i<=5; i++)); do ((sum+=i)); done; echo "$sum" > 'test_file'
Or this one:
for i in {1..5}; do ((sum+=i)); done; echo "$sum" > 'test_file'

http://sed.sourceforge.net/grabbag/scripts/add_decs.sed
#! /bin/sed -f
# This is an alternative approach to summing numbers,
# which works a digit at a time and hence has unlimited
# precision. This time it is done with lookup tables,
# and uses only 10 commands.
G
s/\n/-/
s/$/-/
s/$/;9aaaaaaaaa98aaaaaaaa87aaaaaaa76aaaaaa65aaaaa54aaaa43aaa32aa21a100/
:loop
/^--[^a]/!{
# Convert next digit from both terms into analog form
# and put the two groups next to each other
s/^\([0-9a]*\)\([0-9]\)-\([^-]*\)-\(.*;.*\2\(a*\)\2.*\)/\1-\3-\5\4/
s/^\([^-]*\)-\([0-9a]*\)\([0-9]\)-\(.*;.*\3\(a*\)\3.*\)/\1-\2-\5\4/
# Back to decimal, but keeping the carry in analog form
# \2 matches an `a' if there are at least ten a's, else nothing
#
# 1------------- 3- 4----------------------
# 2 5----
s/-\(aaaaaaaaa\(a\)\)\{0,1\}\(a*\)\([0-9b]*;.*\([0-9]\)\3\5\)/-\2\5\4/
b loop
}
s/^--\([^;]*\);.*/\1/
h

Related

How should I count the duplicate lines in each file?

I have tried this :
dirs=$1
for dir in $dirs
do
ls -R $dir
done
Like this?:
$ cat > foo
this
nope
$ cat > bar
neither
this
$ sort *|uniq -c
1 neither
1 nope
2 this
and weed out the ones with just 1s:
... | awk '$1>1'
2 this
Use sort with uniq to find the duplicate lines.
#!/bin/bash
dirs=("$#")
for dir in "${dirs[#]}" ; do
cat "$dir"/*
done | sort | uniq -c | sort -n | tail -n1
uniq -c will prepend the number of occurrences to each line
sort -n will sort the lines by the number of occurrences
tail -n1 will only output the last line, i.e. the maximum. If you want to see all the lines with the same number of duplicates, add the following instead of tail:
perl -ane 'if ($F[0] == $n) { push #buff, $_ }
else { #buff = $_ }
$n = $F[0];
END { print for #buff }'
You could use awk. If you just want to "count the duplicate lines", we could infer that you're after "all lines which have appeared earlier in the same file". The following would produce these counts:
#!/bin/sh
for file in "$#"; do
if [ -s "$file" ]; then
awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' "$file"
fi
done
The awk script first checks to see if the current line is stored in the array a, and if it does, increments a counter. Then it adds the line to its array. At the end of the file, we print the total.
Note that this might have problems on very large files, since the entire input file needs to be read into memory in the array.
Example:
$ printf 'foo\nbar\nthis\nbar\nthat\nbar\n' > inp.txt
$ awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' inp.txt
inp.txt: 2
The word 'bar' exist three times in the file, thus there are two duplicates.
To aggregate multiple files, you can just feed multiple files to awk:
$ printf 'foo\nbar\nthis\nbar\n' > inp1.txt
$ printf 'red\nblue\ngreen\nbar\n' > inp2.txt
$ awk '$0 in a {c++} {a[$0]} END {print c}' inp1.txt inp2.txt
2
For this, the word 'bar' appears twice in the first file and once in the second file -- a total of three times, thus we still have two duplicates.

How to efficiently get 10% of random lines out of the large file in Linux?

I want to output random 10% lines of total lines of a file. For instance, file a has 1,000,000 lines then I want to output random 100,000 lines out of the file (100,000 being the 10% of 1,000,000) .
There is a easy to do this supposed that the file is small:
randomLine=`wc -l a | awk '{printf("%d\n",($1/10))}'`
sort -R a | head -n $randomLine
But using sort -R is very slow. It will perform a dedicated random computation. My file has 10,000,000 lines. Sorting takes too much time. Is there anyway to archive a less dedicated and not so random but efficient sampling?
Edit Ideas:
To sample a line every ten lines is acceptable. But I don't know how to do this with shell script.
Read line by line and if
echo $RANDOM%100 | bc
is greater than 20 than output the line (Using the number greater than 10 to ensure get no less than 10% of line) and once output 10% line then stop. But I don't know how to read line by line using shell script.
Edit Description
The reason I want to use shell script is that my file contains \r characters. The new line character in the file should be \n but readline() function in Python and Java regards \r and \n as new line character, which doesn't fit my need.
Let's create a random list of X numbers from 1 to Y. You can do it with:
shuf -i 1-Y -nX
In your case,
shuf -i 1-1000000 -n10000
Then you store it in a variable (space separated) and pass to awk, so that you print those line numbers:
awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' <(shuf -i 1-1000000 -n10000) file
Explanation
FNR==NR {a[$1]; next} loop through the shuf results and store them in a a[] array.
{if (FNR in a) print} if the line number of the second parameter (the file) is found in the array a[], print it.
Sample with Y=10, X=2
$ cat a
1 hello
2 i am
3 fe
4 do
5 rqui
6 and
7 this
8 is
9 sample
10 text
$ awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' <(shuf -i 1-10 -n2) a
2 i am
9 sample
$ awk 'FNR==NR {a[$1]; next} {if (FNR in a) print}' <(shuf -i 1-10 -n2) a
4 do
6 and
Improvement
As plundra suggested in comments:
shuf -n $(( $(wc -l < $FILENAME) / 10 )) $FILENAME
I think this is the best way:
file=your file here
lines_in_file=`wc -l < $file`
lines_wanted=$(($lines_in_file/10))
shuf -n $lines_wanted $file
Another creative solution:
echo $RANDOM generates a random number between 0 and 32767
Then, you can do:
echo $(($RANDOM*100000/32767+1))
.. to obtain a random number between 1 and 100000 (as nwellnhof points out in comments below, it's not any number from 1 to 100000, but one of 32768 possible numbers between 1 and 100000, so it's kind of a projection...)
So:
file=your file here
lines_in_file=`wc -l $file | awk {'print $1'}`
lines_wanted=$(($lines_in_file/10))
for i in `seq 1 $lines_wanted`
do line_chosen=$(($RANDOM*${lines_in_file}/32767+1))
sed "${line_chosen}q;d" $file
done
I have this script that will give you roughly 1/x of the lines.
#!/usr/bin/perl -w
use strict;
my $ratio = shift;
while (<>) {
print if ((rand) <= 1 / $ratio);
}
For a large enough $ratio, assuming a uniform distribution of rand's outputs.
Assuming you call this random_select_ratio.pl, run it like this to get 10% of the lines:
random_select_ratio.pl 10 my_file
or
cat my_file | random_select_ratio.pl 10
Just run this awk script with the file as input.
BEGIN { srand() }{ if (rand() < 0.10) print $0; }
It's been a while since I used awk, but I do believe that should do it.
And in fact it does work exactly as expected. Approximately 10% of the lines are output. On my Windows machine using GNU awk, I ran:
awk "BEGIN { srand() }{ if (rand() < 0.10) print $0; }" <numbers.txt >nums.txt
numbers.txt contained the numbers 1 through 1,000,000, one per line. Over multiple runs, the file nums.txt typically contained about 100,200 items, which works out to 10.02%.
If there's a problem with what awk considers a line, you can always change the record separator. That is RS = "\n"; But that should be the default on Linux machine.
Here's one way to do Edit idea 1. in bash:
while readarray -n10 a; do
[ ${#a[#]} = 0 ] && break
printf "%s" "${a[${RANDOM: -1:1}]}"
done < largefile.txt
Kinda slow, though it was about 2.5x faster than the sort -R method on my machine.
We use readarray to read from the input stream 10 lines at a time into an array. Then we use the last digit of $RANDOM as an index into that array and print the resulting line.
Using the readarray/printf combo should ensure the \r characters are passed through unmodified, as in the edited requirement.

bash print first to nth column in a line iteratively

I am trying to get the column names of a file and print them iteratively. I guess the problem is with the print $i but I don't know how to correct it. The code I tried is:
#! /bin/bash
for i in {2..5}
do
set snp = head -n 1 smaller.txt | awk '{print $i}'
echo $snp
done
Example input file:
ID Name Age Sex State Ext
1 A 12 M UT 811
2 B 12 F UT 818
Desired output:
Name
Age
Sex
State
Ext
But the output I get is blank screen.
You'd better just read the first line of your file and store the result as an array:
read -a header < smaller.txt
and then printf the relevant fields:
printf "%s\n" "${header[#]:1}"
Moreover, this uses bash only, and involves no unnecessary loops.
Edit. To also answer your comment, you'll be able to loop through the header fields thus:
read -a header < smaller.txt
for snp in "${header[#]:1}"; do
echo "$snp"
done
Edit 2. Your original method had many many mistakes. Here's a corrected version of it (although what I wrote before is a much preferable way of solving your problem):
for i in {2..5}; do
snp=$(head -n 1 smaller.txt | awk "{print \$$i}")
echo "$snp"
done
set probably doesn't do what you think it does.
Because of the single quotes in awk '{print $i}', the $i never gets expanded by bash.
This algorithm is not good since you're calling head and awk 4 times, whereas you don't need a single external process.
Hope this helps!
You can print it using awk itself:
awk 'NR==1{for (i=2; i<=5; i++) print $i}' smaller.txt
The main problem with your code is that your assignment syntax is wrong. Change this:
set snp = head -n 1 smaller.txt | awk '{print $i}'
to this:
snp=$(head -n 1 smaller.txt | awk '{print $i}')
That is:
Do not use set. set is for setting shell options, numbered parameters, and so on, not for assigning arbitrary variables.
Remove the spaces around =.
To run a command and capture its output as a string, use $(...) (or `...`, but $(...) is less error-prone).
That said, I agree with gniourf_gniourf's approach.
Here's another alternative; not necessarily better or worse than any of the others:
for n in $(head smaller.txt)
do
echo ${n}
done
somthin like
for x1 in $(head -n1 smaller.txt );do
echo $x1
done

wput speed result as pass or fail

I'm using the following to output the result of an upload speed test
wput 10MB.zip ftp://user:pass#host 2>&1 | grep '\([0-9.]\+[KM]/s\)'
which returns
18:14:38 (10MB.zip) - '10.49M/s' [10485760]
Transfered 10,485,760 bytes in 1 file at 10.23M/s
I'd like to have the result 10.23M/s (i.e. the speed) echoed, and a comparison result:
if speed=>5 MB/s then echo "pass" else echo "fail"
So, the final output would be:
PASS 7 M/s
23/01/2013
ideally i'd like it all done on a single line so far i've got
wput 100M.bin ftp://test:test#0.0.0.0 2>&1 | grep -o '\([0-9.]\+[KM]/s\)$' | awk ' { if (($1 > 5) && ($2 == "M/s")) { printf("FAST %s\n ", $0); }}'
however it doesn't output anything if I remove
&& ($2 == "M/s"))
it works but I obviously want to it output above 5M/s and as it is it would still echo fast if it was over 1K/s. Can someone tell me what i've missed.
Using awk:
# Over 5M/s
$ cat pass
18:14:38 (10MB.zip) - '10.49M/s' [10485760]
Transfered 10,485,760 bytes in 1 file at 10.23M/s
$ awk 'END{f="FAIL "$NF;p="PASS "$NF;if($NF~/K\/s/){print f;exit};gsub(/M\/s/,"");print(int($NF)>5?p:f)}' pass
PASS 10.23M/s
# Under 5M/s
$ cat fail
18:14:38 (10MB.zip) - '3.49M/s' [10485760]
Transfered 10,485,760 bytes in 1 file at 3.23M/s
$ awk 'END{f="FAIL "$NF;p="PASS "$NF;if($NF~/K\/s/){print f;exit};gsub(/M\/s/,"");print(int($NF)>5?p:f)}' fail
FAIL 3.23M/s
# Also Handle K/s
$ cat slow
18:14:38 (10MB.zip) - '3.49M/s' [10485760]
Transfered 10,485,760 bytes in 1 file at 8.23K/s
$ awk 'END{f="FAIL "$NF;p="PASS "$NF;if($NF~/K\/s/){print f;exit};gsub(/M\/s/,"");print(int($NF)>5?p:f)}' slow
FAIL 8.23K/s
Not sure where you get 7 M/s from?
According to #Rubens, you can use grep -o with your regex to show the speed, just append $ for end of line
wput 10MB.zip ftp://user:pass#host 2>&1 | grep -o '\([0-9.]\+[KM]/s\)$'
With perl you can easily do the remaining stuff
use strict;
use warnings;
while (<>) {
if (m!\s+((\d+\.\d+)([KM])/s)$!) {
if ($2 > 5 && $3 eq 'M') {
print "PASS $1\n";
} else {
print "FAIL $1\n";
}
}
}
and then call it
wput 10MB.zip ftp://user:pass#host 2>&1 | perl script.pl
This is an answer to the question update.
With the awk program, you haven't split the speed into numeric and unit value. It is just one string.
Because fast speed is greater than 5 M/s, you can ignore K/s and extract the speed by splitting at the character M. Then you have the speed in $1 and can compare it
wput 100M.bin ftp://test:test#0.0.0.0 2>&1 | grep -o '[0-9.]\+M/s$' | awk -F '/M/' '{ if ($1 > 5) { printf("FAST %s\n ", $0); }}'

Simple aggregation using linux scripting

Let's say I have a text file with lines like these:
foo 10
bar 15
bar 5
foo 30
...
What's the simplest way to generate the following output:
foo 40
bar 20
?
This will do:
awk '{arr[$1]+=$2;} END { for (i in arr) print i, arr[i]}' file
For more information, read on Awk's associative arrays.
Use this awk script:
awk '{sums[$1] += $2} END {for (a in sums) print a, sums[a]}' infile
OUTPUT:
foo 40
bar 20
Use this awk tutorial on using associative arrays:
If you are interested in perl:
perl -F -lane '$X{$F[0]}=$X{$F[0]}+$F[1];if(eof){foreach (keys %X){print $_." ".$X{$_}}}' your_file
Here's one way with sort, GNU sed and bc:
sort infile |
sed -r ':a; N; s/([^ ]+) +([^\n]+)\n\1/\1 \2 +/; ta; P; D' |
sed -r 'h; s/[^ ]+/echo/; s/$/ | bc/e; G; s/([^\n]+)\n([^ ]+).*/\2 \1/'
Output:
bar 20
foo 40
The first sed joins adjacent lines with the same key adding a + between the numbers, the second passes the sums to bc.

Resources