Let's assume I have scientific data, all numbers arranged in a single column but representing an intensities matrix of n (width) by m (height). The column of the input file has in total n * m rows. An input example may look like that:
1
2
3
......
30
The new output should be such that I have n new columns with m rows. Sticking to my example with 30 fields input and n = 3, m = 10, I would need an output file like this (separator does not matter much, could be a blank, a tab etc.):
1 11 21
2 12 22
... ... ...
10 20 30
I use gawk under Windows. Please note that there are no special FS, more real-world examples are like 60 * 60 or bigger.
If you are not limited to awk but have GNU core-utils (cygwin, native, ..) then the simplest solution is to use pr:
pr -ts" " --columns 3 file
I believe this will do:
awk '
{ split($0,data); }
END {
m = 10;
n = 3;
for( i = 1; i<=m; i++ ) {
for( j = 0; j<n; j++ ) {
printf "%s ", data[j*m + i] # output data plus space in one line
}
# here you might want to start a new line though you did not ask for it:
printf "\n";
}
}' inputfile
I might have the index counting wrong but I am sure you can figure it out. The trick is the split in the first line. It splits your input on whitespace and creates an array data. The END block runs after processing your file and just accesses data by index. Note array indices count from 0.
Assumption is all data is in a single line. Your question isn't quite clear on this. If it is on several lines you'd have to read it into the array differently.
Hope this gets you started.
EDIT
I notice you changed your question while I was answering it. So change
{ split($0,data); }
to
{ data[++i] = $1; }
to account for the input being on different lines. Actually, this would give you the option to read it into a two dimensional array in the first place.
EDIT 2
Read two dimensional array
To read as a two dimensional array assuming m and n are known beforehand and not encoded in the input somehow:
awk '
BEGIN {
m = 10;
n = 3;
}
{
for( i = 0; i<m; i++ ) {
for( j = 0; j<n; j++ ) {
data[i,j] = $0;
}
}
# do something with data
}' inputfile
However, since you only want to reformat your data you could do it immediately. Combining the two solutions getting rid of data and passing m and n on the command line:
awk -v m=10 -v n=3'
{
for( i = 0; i<m; i++ ) {
for( j = 0; j<n; j++ ) {
printf "%s ", $0 # output data plus space in one line
}
printf "\n";
}
}' inputfile
Here is a fairly simple solution (in the example I've set n equal to 3; plug in the appropriate value for n):
awk -v n=3 '{ row = row $1 " "; if (NR % n == 0) { print row; row = "" } }' FILE
This works by reading in records one line at a time concatenating each line with the preceding lines. When n lines have been concatenated, it prints the concatenated result on a single new line. This repeats until there are no more lines left in the input.
You can use the below command
paste - - - < input.txt
By default, the delimiter is TAB, to change the delimiter, use below command
paste - - - -d' ' < input.txt
Related
I need to compare two versions of the same file. Both are tab-separated and have this form:
<filename1><tab><Marker11><tab><Marker12>...
<filename2><tab><Marker21><tab><Marker22><tab><Marker22>...
So each row has a different number of markers (the number varies between 1 and 10) and they all come from a small set of possible markers. So a file looks like this:
fileX<tab>Z<tab>M<tab>A
fileB<tab>Y
fileM<tab>M<tab>C<tab>B<tab>Y
What I need is:
Sort the file by rows
Sort the markers in each row so that they are in alphabetical order
So for the example above, the result would be
fileB<tab>Y
fileM<tab>B<tab>C<tab>M<tab>Y
fileX<tab>A<tab>M<tab>Z
It's easy to do #1 using sort but how do I do #2?
UPDATE: It's not a duplicate of this post since my rows are of different length and I need each rows (the entries after the filename) sorted individually, i.e. the only column that gets preserved is the first one.
awk solution:
awk 'BEGIN{ FS=OFS="\t"; PROCINFO["sorted_in"]="#ind_str_asc" }
{ split($0,b,FS); delete b[1]; asort(b); r="";
for(i in b) r=(r!="")? r OFS b[i] : b[i]; a[$1] = r
}
END{ for(i in a) print i,a[i] }' file
The output:
fileB Y
fileM B C M Y
fileX A M Z
PROCINFO["sorted_in"]="#ind_str_asc" - sort mode
split($0,b,FS); - split the line into array b by FS (field separator)
asort(b) - sort marker values
All you need is:
awk '
{ for (i=2;i<=NF;i++) arr[$1][$i] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in arr) {
printf "%s", i
for (j in arr[i]) {
printf "%s%s, OFS, arr[i][j]
}
print ""
}
}
' file
The above uses GNU awk for true multi-dimensional arrays plus sorted_in
I'm thinking if there is a way to split the column by matching the header ?
The data looks like this
ID_1 ID_2 ID_3 ID_6 ID_15
value1 0 2 4 7 6
value2 0 4 4 3 8
value3 2 2 3 7 8
I would like to get the columns only on ID_3 & ID_15
ID_3 ID_15
4 6
4 8
3 8
awk can simply separate it if I know the order of the column
However, I have a very huge table and only have a list of ID in hands.
Can I still use awk or there is an easier way in linux ?
The input format isn't well defined, but there are a few simple ways, awk, perl and sqlite.
(FNR==1) {
nocol=split(col,ocols,/,/) # cols contains named columns
ncols=split("vals " $0,cols) # header line
for (nn=1; nn<=ncols; nn++) colmap[cols[nn]]=nn # map names
OFS="\t" # to align output
for (nn=1; nn<=nocol; nn++) printf("%s%s",ocols[nn],OFS)
printf("\n") # output header line
}
(FNR>1) { # read data
for (nn=1; nn<=nocol; nn++) {
if (nn>1) printf(OFS) # pad
if (ocols[nn] in colmap) { printf("%s",$(colmap[ocols[nn]])) }
else { printf "--" } # named column not in data
}
printf("\n") # wrap line
}
$ nawk -f mycols.awk -v col=ID_3,ID_15 data
ID_3 ID_15
4 6
4 8
3 8
Perl, just a variation on the above with some perl idioms to confuse/entertain:
use strict;
use warnings;
our #ocols=split(/,/,$ENV{cols}); # cols contains named columns
our $nocol=scalar(#ocols);
our ($nn,%colmap);
$,="\t"; # OFS equiv
# while (<>) {...} implicit with perl -an
if ($. == 1) { # FNR equiv
%colmap = map { $F[$_] => $_+1 } 0..$#F ; # create name map hash
$colmap{vals}=0; # name anon 1st col
print #ocols,"\n"; # output header
} else {
for ($nn = 0; $nn < $nocol; $nn++) {
print "\t" if ($nn>0);
if (exists($colmap{$ocols[$nn]})) { printf("%s",$F[$colmap{$ocols[$nn]}]) }
else { printf("--") } # named column not in data
}
printf("\n")
}
$ cols="ID_3,ID_15" perl -an mycols.pl < data
That uses an environment variable to skip effort parsing the command line. It needs the perl options -an which set up field-splitting and an input read loop (much like awk does).
And with sqlite (I used v3.11, v3.8 or later is required for useful .import I believe). This uses an in-memory temporary database (name a file if too large for memory, or for a persistent copy of the parsed data), and automatically creates a table based on the first line. The advantages here are that you might not need any scripting at all, and you can perform multiple queries on your data with just one parse overhead.
You can skip this next step if you have a single hard-tab delimiting the columns, in which case replace .mode csv with .mode tab in the sqlite example below.
Otherwise, to convert your data to a suitable CSV-ish format:
nawk -v OFS="," '(FNR==1){$0="vals " $0} {$1=$1;print} < data > data.csv
This adds a dummy first column "vals" to the first line, then prints each line as comma-separated, it does this by a seemingly pointless assignment to $1, but this causes $0 to be recomputed replacing FS (space/tab) with OFS (comma).
$ sqlite3
sqlite> .mode csv
sqlite> .import data.csv mytable
sqlite> .schema mytable
CREATE TABLE mytable(
"vals" TEXT,
"ID_1" TEXT,
"ID_2" TEXT,
"ID_3" TEXT,
"ID_6" TEXT,
"ID_15" TEXT
);
sqlite> select ID_3,ID_15 from mytable;
ID_3,ID_15
4,6
4,8
3,8
sqlite> .mode column
sqlite> select ID_3,ID_15 from mytable;
ID_3 ID_15
---------- ----------
4 6
4 8
3 8
Use .once or .output to send output to a file (sqlite docs). Use .headers on or .headers off as required.
sqlite is quite happy to create an unnamed column, so you don't have to add a name to the first column of the header line, but you do need to make sure the number of columns is the same for all input lines and formats.
If you get "expected X columns but found Y" errors during the .import then you'll need to clean up the data format a little for this.
$ cat c.awk
NR == 1 {
for (i=1; i<=NF; ++i) {
if ($i == "ID_3") col_3 = (i + 1)
if ($i == "ID_15") col_15 = (i + 1)
}
print "ID_3", "ID_15"
}
NR > 1 { print $col_3, $col_15 }
$ awk -f c.awk c.txt
ID_3 ID_15
4 6
4 8
3 8
You could go for something like this:
BEGIN {
keys["ID_3"]
keys["ID_15"]
}
NR == 1 {
for (i = 1; i <= NF; ++i)
if ($i in keys) cols[++n] = i
}
{
for (i = 1; i <= n; ++i)
printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS)
}
Save the script to a file and run it like awk -f script.awk file.
Alternatively, as a "one-liner":
awk 'BEGIN { keys["ID_3"]; keys["ID_15"] }
NR == 1 { for (i = 1; i <= NF; ++i) if ($i in keys) cols[++n] = i }
{ for (i = 1; i <= n; ++i) printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS) }' file
Before the file is processed, keys are set in the keys array, corresponding to the column headings of interest.
On the first line, record all the column numbers that contain one of the keys in the cols array.
Loop through each of the cols and print them out, followed by either the output field separator OFS or the output record separator ORS, depending on whether it's the last one. $(cols[i]+(NR>1)) handles the fact that rows after the first have an extra field at the start, because NR>1 will be true (1) for those lines and false (0) for the first line.
Try below script:
#!/bin/sh
file="$1"; shift
awk -v cols="$*" '
BEGIN{
split(cols,C)
OFS=FS="\t"
getline
split($0,H)
for(c in C){
for(h in H){
if(C[c]==H[h])F[i++]=h
}
}
}
{ l="";for(f in F){l=l $F[f] OFS}print l }
' "$file"
In command line type:
[sumit.gupta#rpm01 ~]$ test.sh filename ID_3 ID_5
I have five different files. Part of each file looks as:
ifile1.txt ifile2.txt ifile3.txt ifile4.txt ifile5.txt
2 3 2 3 2
1 2 /no value 2 3
/no value 2 4 3 /no value
3 1 0 0 1
/no value /no value /no value /no value /no value
I need to compute average of these five files without considering missing values. i.e.
ofile.txt
2.4
2.0
3.0
1.0
99999
Here 2.4 = (2+3+2+3+2)/5
2.0 = (1+2+2+3)/4
3.0 = (2+4+3)/3
1.0 = (3+1+0+0+1)/5
99999 = all are missing
I was trying in the following way, but don't feel it is a proper way.
paste ifile1.txt ifile2.txt ifile3.txt ifile4.txt ifile5.txt > ofile.txt
tr '\n' ' ' < ofile.txt > ofile1.txt
awk '!/\//{sum += $1; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile2.txt
awk '!/\//{sum += $2; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile3.txt
awk '!/\//{sum += $3; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile4.txt
awk '!/\//{sum += $4; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile5.txt
awk '!/\//{sum += $5; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile6.txt
paste ofile2.txt ofile3.txt ofile4.txt ofile5.txt ofile6.txt > ofile7.txt
tr '\n' ' ' < ofile7.txt > ofile.txt
The following script.awk will deliver what you want:
BEGIN {
gap = -1;
maxidx = -1;
}
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
You call it with:
awk -f script.awk ifile*.txt
and it allows for an arbitrary number of input files, each with an arbitrary number of lines. It works as follows:
BEGIN {
gap = -1;
maxidx = -1;
}
This begin section runs before any lines are processed and it sets the current gap and maximum index accordingly.
The gap is the difference between the overall line number NR and the file line number FNR, used to detect when you switch files, something that's very handy when processing multiple input files.
The maximum index is used to figure out the largest line count so as to output the correct number of records at the end.
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
The above code is the meat of the solution, executed per line. The first if statement is used to detect whether you've just moved into a new file and it does this simply so it can aggregate all the associated lines from each file. By that I mean the first line in each input file is used to calculate the average for the first line of the output file.
The second if statement adjusts maxidx if the current line number is beyond any previous line number we've encountered. This is for the case where file one may have seven lines but file two has nine lines (not so in your case but it's worth handling anyway). A previously unencountered line number also means we initialise its sum and count to be zero.
The final if statement simply updates the sum and count if the line contains anything other than /no value.
And then, of course, you need to adjust the line number for the next time through.
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
In terms of outputting the data, it's a simple matter of going through the array and calculating the average from the sum and count. Notice that, if the count is zero (all corresponding entries were /no value), we adjust the sum and count so as to get 99999 instead. Then we just print the average.
So, running that code over your input files gives, as requested:
$ awk -f script.awk ifile*.txt
2.4
2
3
1
99999
Using bash and numaverage (which ignores non-numeric input), plus paste, sed and tr (both for cleaning, since numaverage needs single column input, and throws an error if input is 100% text):
paste ifile* | while read x ; do \
numaverage <(tr '\t' '\n' <<< "$x") 2>&1 | \
sed -n '1{s/Emp.*/99999/;p}' ; \
done
Output:
2.4
2
3
1
99999
I wanted to split the large file (185 Million records) to more than one files based on one column value.The file is .dat file and the delimiter used inbetween the columns are ^A (\u0001).
The File content is like this:
194^A1^A091502^APR^AKIMBERLY^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A1^A091502^APR^AJOHN^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A^A091502^APR^AASHLEY^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A3^A091502^APR^APETER^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A4^A091502^APR^AJOE^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
now i wanted to split the file based on second column value, if you see the third row the second column value is empty, so all the empty rows should come one file , remaining all should come one file.
Please help me on this. I tried to google, it seems we should use awk for this.
Regards,
Shankar
With awk:
awk -F '\x01' '$2 == "" { print > "empty.dat"; next } { print > "normal.dat" }' filename
The file names can be chosen arbitrarily, of course. print > "file" prints the current record to a file named "file".
Addendum re: comment: Removing the column is a little trickier but certainly feasible. I'd use
awk -F '\x01' 'BEGIN { OFS = FS } { fname = $2 == "" ? "empty.dat" : "normal.dat"; for(i = 2; i < NF; ++i) $i = $(i + 1); --NF; print > fname }' filename
This works as follows:
BEGIN { # output field separator is
OFS = FS # the same as input field
# separator, so that the
# rebuilt lines are formatted
# just like they came in
}
{
fname = $2 == "" ? "empty.dat" : "normal.dat" # choose file name
for(i = 2; i < NF; ++i) { # set all fields after the
$i = $(i + 1) # second back one position
}
--NF # let awk know the last field
# is not needed in the output
print > fname # then print to file.
}
I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly
from the file. This is the AWK code I have, but it slurps all the file content
before hand. My PC memory cannot handle such slurps. Is there other approach to do it?
awk 'BEGIN{srand()}
!/^$/{ a[c++]=$0}
END {
for ( i=1;i<=c ;i++ ) {
num=int(rand() * c)
if ( a[num] ) {
print a[num]
delete a[num]
d++
}
if ( d == c/100 ) break
}
}' file
if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?
In that second case, just randomize at 1% at each line...
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'
If you'd like the header line plus a random sample of lines after, use:
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'
You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):
cat your_file.txt | perl -n -e 'print if (rand() < .01)'
(simpler form, from comments):
perl -ne 'print if (rand() < .01)' your_file.txt
I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.
This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)
Original thread: Submitted for your review -- random sampling in awk.
# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2
BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
t = n
srand()
}
NR <= n {
pool[NR] = $0
places[NR] = NR
next
}
NR > n {
t++
M = int(rand()*t) + 1
if (M <= n) {
READ_NEXT_RECORD(M)
}
}
END {
if (NR < n) {
print "sample.awk: Not enough records for sample" \
> "/dev/stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
print newpool[ordered[i]]
}
function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}
This should work on most any GNU/Linux machine.
$ shuf -n $(( $(wc -l < $file) / 100)) $file
I'd be surprised if memory management was done inappropriately by the GNU shuf command.
I don't know awk, but there is a great technique for solving a more general version of the problem you've described, and in the general case it is quite a lot faster than the for line in file return line if rand < 0.01 approach, so it might be useful if you intend to do tasks like the above many (thousands, millions) of times. It is known as reservoir sampling and this page has a pretty good explanation of a version of it that is applicable to your situation.
The problem of how to uniformly sample N elements out of a large population (of unknown size) is known as Reservoir Sampling. (If you like algorithms problems, do spend a few minutes trying to solve it without reading the algorithm on Wikipedia.)
A web search for "Reservoir Sampling" will find a lot of implementations. Here is Perl and Python code that implements what you want, and here is another Stack Overflow thread discussing it.
In this case, reservoir sampling to get exactly k values is trivial enough with awk that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk program for sampling:
#!/usr/bin/env awk -f
BEGIN{
srand();
if(k=="") k=10
}
NR <= k {
reservoir[NR-1] = $0;
next;
}
{ i = int(NR * rand()) }
i < k { reservoir[i] = $0 }
END {
for (i in reservoir) {
print reservoir[i];
}
}
If saved as sample_lines and made executable, it can be run like: ./sample_lines -v k=5 input_file. If k is not given, then 10 will be used by default.
Then figuring out what k is has to be done separately, for example by setting -v "k=$(dc -e "$(cat input_file | wc -l) 100 / n")"
You could do it in two passes:
Run through the file once, just to count how many lines there are
Randomly select the line numbers of the lines you want to print, storing them in a sorted list (or a set)
Run through the file once more and pick out the lines at the selected positions
Example in python:
fn = '/usr/share/dict/words'
from random import randint
from sys import stdout
count = 0
with open(fn) as f:
for line in f:
count += 1
selected = set()
while len(selected) < count//100:
selected.add(randint(0, count-1))
index = 0
with open(fn) as f:
for line in f:
if index in selected:
stdout.write(line)
index += 1
Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.
If the aim is just to avoid memory exhaustion, and the file is a regular file, no need to implement reservoir sampling. The number of lines in the file can be known if you do two passes in the file, one to get the number of lines (like with wc -l), one to select the sample:
file=/some/file
awk -v percent=0.01 -v n="$(wc -l < "$file")" '
BEGIN {srand(); p = int(n * percent)}
rand() * n-- < p {p--; print}' < "$file"
Here's my version. In the below 'c' is the number of lines to select from the input. Making c a parameter is left as an exercise for the reader, as is the reason the line starting with c/NR works to reliably select exactly c lines (assuming input has at least c lines).
#!/bin/sh
gawk '
BEGIN { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines) print lines[i] }
' "$#"