Two file numeric comparison in awk - linux

I'm trying to compare the contents of two files, both of which are just a single column of numbers, i.e.
File1:
1.2
2.6
3.4
4.7
5.3
File2:
5.1
4.8
3.2
2.5
1.6
The output should just be the number of lines in file1 that are greater than the corresponding line in file2; so in this case it'd just be
3

awk single process can do that job:
awk 'NR==FNR{a[NR]=$0;next}a[FNR]>$0{i++}END{print i}' file1 file2
outputs:
3
EDIT
by reading JonathanLeffler and steveha's comments, I would add another solution, to avoid to save a monster file into memory. still single awk process:
awk '{getline x < "file2"}$0>x{i++}END{print i}' file1
outputs:
3

Try using paste followed by awk
paste file1 file2 | awk '$1>$2 {i++} END {print i}'
Output:
3

Here is a solution using only AWK, reading only one line at a time from each input file.
BEGIN {
if (ARGC != 3)
{
print "Usage: this_program <file1> <file2>"
exit(1)
}
c = 0
for (;;)
{
result = getline < ARGV[1]
if (1 != result)
break
n1 = $1 + 0
result = getline < ARGV[2]
if (1 != result)
break
n2 = $1 + 0
if (n1 > n2)
++c;
}
print c
}
P.S. I'm a fan of Python and for fun I also solved this in Python.
import sys
if sys.version_info.major < 3:
import itertools
zip = itertools.izip
with open(sys.argv[1]) as f1, open(sys.argv[2]) as f2:
print(sum(float(x) > float(y) for x, y in zip(f1, f2)))
Notes:
zip() pairs values read from two sources. zip(f1, f2) pairs a line read from each of the two input files.
I made this use itertools.izip() when you run it on Python 2.x, so it will handle one line at a time. The built-in zip() in Python 2 reads all the data at once and builds a list.
The error checking isn't obvious but it is there. If an input doesn't work as a float value, you will get an exception; if the user doesn't specify at least two input files, you will get an exception.
This is using a slightly sleazy trick: sum() will treat a Boolean True value as a 1, and a Boolean False value as a 0. Thus this gets a count of all the lines for which the > comparison is true.

Related

separate columns of a text file

Hii experts i have a big text file that contain many columns.Now i want to extract each column in separate text file serially with adding two strings on the top.
suppose i have a input file like this
2 3 4 5 6
3 4 5 6 7
2 3 4 5 6
1 2 2 2 2
then i need to extract each column in separate text file with two strings on the top
file1.txt file2.txt .... filen.txt
s=5 s=5
r=9 r=9
2 3
3 4
2 3
1 2
i tried script as below:but it doesnot work properly.need help from experts.Thanks in advance.
#!/bin/sh
for i in $(seq 1 1 5)
do
echo $i
awk '{print $i}' inp_file > file_$i
done
Could you please try following, written and tested with shown samples in GNU awk. Following doesn't have close file function used because your sample shows you have only 5 columns in Input_file. Also created 2 awk variables which will be printed before actual column values are getting printed to output file(named var1 and var2).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i > (outputFile)
}
}
' Input_file
In case you can have more than 5 or more columns then better close output files kin backend using close option, use this then(to avoid error too many files opened).
awk -v var1="s=5" -v var2="r=9" '
{
count++
for(i=1;i<=NF;i++){
outputFile="file"i".txt"
if(count==1){
print (var1 ORS var2) > (outputFile)
}
print $i >> (outputFile)
}
close(outputFile)
}
' Input_file
Pretty simple to do in one pass through the file with awk using its output redirection:
awk 'NR==1 { for (n = 1; n <= NF; n++) print "s=5\nr=9" > ("file_" n) }
{ for (n = 1; n <= NF; n++) print $n > ("file_" n) }' inp_file
With GNU awk to internally handle more than a dozen or so simultaneously open files:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
}
}
{
for (i=1; i<=NF; i++) {
print $i > out[i]
}
}
or with any awk just close them as you go:
NR == 1 {
for (i=1; i<=NF; i++) {
out[i] = "file" i ".txt"
print "s=5" ORS "r=9" > out[i]
close(out[i])
}
}
{
for (i=1; i<=NF; i++) {
print $i >> out[i]
close(out[i])
}
}
split -nr/$(wc -w <(head -1 input) | cut -d' ' -f1) -t' ' --additional-suffix=".txt" -a4 --numeric-suffix=1 --filter "cat <(echo -e 's=5 r=9') - | tr ' ' '\n' >\$FILE" <(tr -s '\n' ' ' <input) file
This uses the nifty split command in a unique way to rearrange the columns. Hopefully it's faster than awk, although after spending a considerable amount of time coding it, testing it, and writing it up, I find that it may not be scalable enough for you since it requires a process per column, and many systems are limited in user processes (check ulimit -u). I submit it though because it may have some limited learning usefulness, to you or to a reader down the line.
Decoding:
split -- Divide a file up into subfiles. Normally this is by lines or by size but we're tweaking it to use columns.
-nr/$(...) -- Use round-robin output: Sort records (in our case, matrix cells) into the appropriate number of bins in a round-robin fashion. This is the key to making this work. The part in parens means, count (wc) the number of words (-w) in the first line (<(head -1 input)) of the input and discard the filename (cut -d' ' -f1), and insert the output into the command line.
-t' ' -- Use a single space as a record delimiter. This breaks the matrix cells into records for split to split on.
--additional-suffix=".txt" -- Append .txt to output files.
-a4 -- Use four-digit numbers; you probably won't get 1,000 files out of it but just in case ...
--numeric-suffix=1 -- Add a numeric suffix (normally it's a letter combination) and start at 1. This is pretty pedantic but it matches the example. If you have more than 100 columns, you will need to add a -a4 option or whatever length you need.
--filter ... -- Pipe each file through a shell command.
Shell command:
cat -- Concatenate the next two arguments.
<(echo -e 's=5 r=9') -- This means execute the echo command and use its output as the input to cat. We use a space instead of a newline to separate because we're converting spaces to newlines eventually and it is shorter and clearer to read.
- -- Read standard input as an argument to cat -- this is the binned data.
| tr ' ' '\n' -- Convert spaces between records to newlines, per the desired output example.
>\$FILE -- Write to the output file, which is stored in $FILE (but we have to quote it so the shell doesn't interpret it in the initial command).
Shell command over -- rest of split arguments:
<(tr -s '\n' ' ' < input) -- Use, as input to split, the example input file but convert newlines to spaces because we don't need them and we need a consistent record separator. The -s means only output one space between each record (just in case we got multiple ones on input).
file -- This is the prefix to the output filenames. The output in my example would be file0001.txt, file0002.txt, ..., file0005.txt.

AWK print every other column, starting from the last column (and next to last column) for N interations (print from right to left)

Hopefully someone out there in the world can help me, and anyone else with a similar problem, find a simple solution to capturing data. I have spent hours trying a one liner to solve something I thought was a simple problem involving awk, a csv file, and saving the output as a bash variable. In short here's the nut...
The Missions:
1) To output every other column, starting from the LAST COLUMN, with a specific iteration count.
2) To output every other column, starting from NEXT TO LAST COLUMN, with a specific iteration count.
The Data (file.csv):
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
Desired results for Mission 1:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Desired results for Mission 2:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
My Attempts:
The closes I have come to solving any of the above problems, is an ugly pipe (which is OK for skinning a cat) for Mission 1. However, it doesn't use any declared iterations (which should be 5). Also, I'm completely lost on solving Mission 2.
Any help to simplify the below and solving Mission 2 will be HELLA appreciated!
outcome=$( awk 'BEGIN {FS = "#"} {for (i = 0; i <= NF; i += 2) printf ("%s%c", $(NF-i), i + 2 <= NF ? "#" : "\n");}' file.csv | sed 's/##.*//g' | awk -F# '{for (i=NF;i>0;i--){printf $i"#"};printf "\n"}' | sed 's/#$//g' | awk -F# '{$1="";print $0}' OFS=# | sed 's/^#//g' );
Also, if doing a loop for a specific number of iterations is helpful in solving this problem, then magic number is 5. Maybe a solution could be a for-loop that is counting from right to left and skipping every other column as 1 iteration, with the starting column declared as an awk variable (Just a thought I have no way of knowing how to do)
Thank you for looking over this problem.
There are certainly more elegant ways to do this, but I am not really an awk person:
Part 1:
awk -F# '{ x = ""; for (f = NF; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
Part 2:
awk -F# '{ x = ""; for (f = NF - 1; f > (NF - 5 * 2); f -= 2) { x = x ? $f "#" x : $f ; } print x }' file.csv
Output:
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
The literal 5 in each of those is your "number of iterations."
Sample data:
$ cat mission.dat
#12#SayWhat#2#4#2.25#3#1.5#1#1#1#3.25
#7#Smarty#9#6#5.25#5#4#4#3#2#3.25
#4#IfYouLike#4#1#.2#1#.5#2#1#3#3.75
#3#LaughingHard#8#8#13.75#8#13#6#8.5#4#6
#10#AtFunny#1#3#.2#2#.5#3#3#5#6.5
#8#PunchLines#7#7#10.25#7#10.5#8#11#6#12.75
One awk solution:
NOTE: OP can add logic to validate the input parameters.
$ cat mission
#!/bin/bash
# format: mission { 1 | 2 } { number_of_fields_to_display }
mission=${1} # assumes user inputs "1" or "2"
offset=$(( mission - 1 )) # subtract one to determine awk/NF offset
iteration_count=${2} # assume for now this is a positive integer
awk -F"#" -v offset=${offset} -v itcnt=${iteration_count} 'BEGIN { OFS=FS }
{ # we will start by counting fields backwards until we run out of fields
# or we hit "itcnt==iteration_count" fields
loopcnt=0
for (i=NF-offset ; i>=0; i-=2) # offset=0 for mission=1; offset=1 for mission=2
{ loopcnt++
if (loopcnt > itcnt)
break
fstart=i # keep track of the field we want to start with
}
# now printing our fields starting with field # "fstart";
# prefix the first printf with a empty string, then each successive
# field is prefixed with OFS=#
pfx = ""
for (i=fstart; i<= NF-offset; i+=2)
{ printf "%s%s",pfx,$i
pfx=OFS
}
# terminate a line of output with a linefeed
printf "\n"
}
' mission.dat
Some test runs:
###### mission #1
# with offset/iteration = 4
$ mission 1 4
2.25#1.5#1#3.25
5.25#4#3#3.25
.2#.5#1#3.75
13.75#13#8.5#6
.2#.5#3#6.5
10.25#10.5#11#12.75
#with offset/iteration = 5
$ mission 1 5
2#2.25#1.5#1#3.25
9#5.25#4#3#3.25
4#.2#.5#1#3.75
8#13.75#13#8.5#6
1#.2#.5#3#6.5
7#10.25#10.5#11#12.75
# with offset/iteration = 6
$ mission 1 6
12#2#2.25#1.5#1#3.25
7#9#5.25#4#3#3.25
4#4#.2#.5#1#3.75
3#8#13.75#13#8.5#6
10#1#.2#.5#3#6.5
8#7#10.25#10.5#11#12.75
###### mission #2
# with offset/iteration = 4
$ mission 2 4
4#3#1#1
6#5#4#2
1#1#2#3
8#8#6#4
3#2#3#5
7#7#8#6
# with offset/iteration = 5
$ mission 2 5
SayWhat#4#3#1#1
Smarty#6#5#4#2
IfYouLike#1#1#2#3
LaughingHard#8#8#6#4
AtFunny#3#2#3#5
PunchLines#7#7#8#6
# with offset/iteration = 6;
# notice we pick up field #1 = empty string so output starts with a '#'
$ mission 2 6
#SayWhat#4#3#1#1
#Smarty#6#5#4#2
#IfYouLike#1#1#2#3
#LaughingHard#8#8#6#4
#AtFunny#3#2#3#5
#PunchLines#7#7#8#6
this is probably not what you're asking but perhaps will give you an idea.
$ awk -F_ -v skip=4 -v endoff=0 '
BEGIN {OFS=FS}
{offset=(NF-endoff)%skip;
for(i=offset;i<=NF-endoff;i+=skip) printf "%s",$i (i>=(NF-endoff)?ORS:OFS)}' file
112_116_120
122_126_130
132_136_140
142_146_150
you specify the number of skips between columns and the end offset as input variables. Here, for last column end offset is set to zero and skip column is 4.
For clarity I used the input file
$ cat file
_111_112_113_114_115_116_117_118_119_120
_121_122_123_124_125_126_127_128_129_130
_131_132_133_134_135_136_137_138_139_140
_141_142_143_144_145_146_147_148_149_150
changing FS for your format should work.

Emacs: how to concatenate two rows together to form unique identifier? [duplicate]

Input where identifier specified by two rows 1-2
L1_I L1_I C-14 <---| unique idenfier
WWPTH WWPT WWPTH <---| on two rows
1 2 3
Goal: how to concatenate the rows?
L1_IWWPTH L1_IWWPT C-14WWPTH <--- unique identifier
1 2 3
P.s. I will accept the simplest and most elegant solution.
Assuming that the input is in a file called file:
$ awk 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} NR==2{for (i=1;i<=NF;i++) printf "%-20s",a[i] $i;print"";next} 1' file
L1_IWWPTH L1_IWWPT C-14WWPTH
1 2 3
How it works
NR==1{for (i=1;i<=NF;i++) a[i]=$i;next}
For the first line, save all the column headings in the array a. Then, skip over the rest of the commands and jump to the next line.
NR==2{for (i=1;i<=NF;i++) printf "%-20s",a[i] $i;print"";next}
For the second line, print all the column headings, merging together the ones from the first and second rows. Then, skip over the rest of the commands and jump to the next line.
1
1 is awk's cryptic shorthand for print the line as is. This is done for all lines after the seconds.
Tab-separated columns with possible missing columns
If columns are tab-separated:
awk -F'\t' 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} NR==2{for (i=1;i<=NF;i++) printf "%s\t",a[i] $i;print"";next} 1' file
If you plan to use python, you can use zip in the following way:
input = [['L1_I', 'L1_I', 'C-14'], ['WWPTH','WWPT','WWPTH'],[1,2,3]]
output = [[i+j for i,j in zip(input[0],input[1])]] + input[2:]
print output
output:
[['L1_IWWPTH', 'L1_IWWPT', 'C-14WWPTH'], [1, 2, 3]]
#!/usr/bin/awk -f
NR == 1 {
split($0, a)
next
}
NR == 2 {
for (b in a)
printf "%-20s", a[b] $b
print ""
next
}
1

Implement tail with awk

I am struggling with this awk code which should emulate the tail command
num=$1;
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[$i]
}
So what I'm trying to achieve here is an tail command emulated by awk/
For example consider cat somefile | awk -f tail.awk 10
should print the last 10 lines of a text file, any suggestions?
All of these answers store the entire source file. That's a horrible idea and will break on larger files.
Here's a quick way to store only the number of lines to be outputted (note that the more efficient tail will always be faster because it doesn't read the entire source file!):
awk -vt=10 '{o[NR%t]=$0}END{i=(NR<t?0:NR);do print o[++i%t];while(i%t!=NR%t)}'
more legibly (and with less code golf):
awk -v tail=10 '
{
output[NR % tail] = $0
}
END {
if(NR < tail) {
i = 0
} else {
i = NR
}
do {
i = (i + 1) % tail;
print output[i]
} while (i != NR % tail)
}'
Explanation of legible code:
This uses the modulo operator to store only the desired number of items (the tail variable). As each line is parsed, it is stored on top of older array values (so line 11 gets stored in output[1]).
The END stanza sets an increment variable i to either zero (if we've got fewer than the desired number of lines) or else the number of lines, which tells us where to start recalling the saved lines. Then we print the saved lines in order. The loop ends when we've returned to that first value (after we've printed it).
You can replace the if/else stanza (or the ternary clause in my golfed example) with just i = NR if you don't care about getting blank lines to fill the requested number (echo "foo" |awk -vt=10 … would have nine blank lines before the line with "foo").
for(i=NR-num;i<=NR;i++)
print vect[$i]
$ indicates a positional parameter. Use just plain i:
for(i=NR-num;i<=NR;i++)
print vect[i]
The full code that worked for me is:
#!/usr/bin/awk -f
BEGIN{
num=ARGV[1];
# Make that arg empty so awk doesn't interpret it as a file name.
ARGV[1] = "";
}
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[i]
}
You should probably add some code to the END to handle the case when NR < num.
You need to add -v num=10 to the awk commandline to set the value of num. And start at NR-num+1 in your final loop, otherwise you'll end up with num+1 lines of output.
This might work for you:
awk '{a=a b $0;b=RS;if(NR<=v)next;a=substr(a,index(a,RS)+1)}END{print a}' v=10

Randomly Pick Lines From a File Without Slurping It With Unix

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly
from the file. This is the AWK code I have, but it slurps all the file content
before hand. My PC memory cannot handle such slurps. Is there other approach to do it?
awk 'BEGIN{srand()}
!/^$/{ a[c++]=$0}
END {
for ( i=1;i<=c ;i++ ) {
num=int(rand() * c)
if ( a[num] ) {
print a[num]
delete a[num]
d++
}
if ( d == c/100 ) break
}
}' file
if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?
In that second case, just randomize at 1% at each line...
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'
If you'd like the header line plus a random sample of lines after, use:
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'
You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):
cat your_file.txt | perl -n -e 'print if (rand() < .01)'
(simpler form, from comments):
perl -ne 'print if (rand() < .01)' your_file.txt
I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.
This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)
Original thread: Submitted for your review -- random sampling in awk.
# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2
BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
t = n
srand()
}
NR <= n {
pool[NR] = $0
places[NR] = NR
next
}
NR > n {
t++
M = int(rand()*t) + 1
if (M <= n) {
READ_NEXT_RECORD(M)
}
}
END {
if (NR < n) {
print "sample.awk: Not enough records for sample" \
> "/dev/stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
print newpool[ordered[i]]
}
function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}
This should work on most any GNU/Linux machine.
$ shuf -n $(( $(wc -l < $file) / 100)) $file
I'd be surprised if memory management was done inappropriately by the GNU shuf command.
I don't know awk, but there is a great technique for solving a more general version of the problem you've described, and in the general case it is quite a lot faster than the for line in file return line if rand < 0.01 approach, so it might be useful if you intend to do tasks like the above many (thousands, millions) of times. It is known as reservoir sampling and this page has a pretty good explanation of a version of it that is applicable to your situation.
The problem of how to uniformly sample N elements out of a large population (of unknown size) is known as Reservoir Sampling. (If you like algorithms problems, do spend a few minutes trying to solve it without reading the algorithm on Wikipedia.)
A web search for "Reservoir Sampling" will find a lot of implementations. Here is Perl and Python code that implements what you want, and here is another Stack Overflow thread discussing it.
In this case, reservoir sampling to get exactly k values is trivial enough with awk that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk program for sampling:
#!/usr/bin/env awk -f
BEGIN{
srand();
if(k=="") k=10
}
NR <= k {
reservoir[NR-1] = $0;
next;
}
{ i = int(NR * rand()) }
i < k { reservoir[i] = $0 }
END {
for (i in reservoir) {
print reservoir[i];
}
}
If saved as sample_lines and made executable, it can be run like: ./sample_lines -v k=5 input_file. If k is not given, then 10 will be used by default.
Then figuring out what k is has to be done separately, for example by setting -v "k=$(dc -e "$(cat input_file | wc -l) 100 / n")"
You could do it in two passes:
Run through the file once, just to count how many lines there are
Randomly select the line numbers of the lines you want to print, storing them in a sorted list (or a set)
Run through the file once more and pick out the lines at the selected positions
Example in python:
fn = '/usr/share/dict/words'
from random import randint
from sys import stdout
count = 0
with open(fn) as f:
for line in f:
count += 1
selected = set()
while len(selected) < count//100:
selected.add(randint(0, count-1))
index = 0
with open(fn) as f:
for line in f:
if index in selected:
stdout.write(line)
index += 1
Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.
If the aim is just to avoid memory exhaustion, and the file is a regular file, no need to implement reservoir sampling. The number of lines in the file can be known if you do two passes in the file, one to get the number of lines (like with wc -l), one to select the sample:
file=/some/file
awk -v percent=0.01 -v n="$(wc -l < "$file")" '
BEGIN {srand(); p = int(n * percent)}
rand() * n-- < p {p--; print}' < "$file"
Here's my version. In the below 'c' is the number of lines to select from the input. Making c a parameter is left as an exercise for the reader, as is the reason the line starting with c/NR works to reliably select exactly c lines (assuming input has at least c lines).
#!/bin/sh
gawk '
BEGIN { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines) print lines[i] }
' "$#"

Resources