I'm thinking if there is a way to split the column by matching the header ?
The data looks like this
ID_1 ID_2 ID_3 ID_6 ID_15
value1 0 2 4 7 6
value2 0 4 4 3 8
value3 2 2 3 7 8
I would like to get the columns only on ID_3 & ID_15
ID_3 ID_15
4 6
4 8
3 8
awk can simply separate it if I know the order of the column
However, I have a very huge table and only have a list of ID in hands.
Can I still use awk or there is an easier way in linux ?
The input format isn't well defined, but there are a few simple ways, awk, perl and sqlite.
(FNR==1) {
nocol=split(col,ocols,/,/) # cols contains named columns
ncols=split("vals " $0,cols) # header line
for (nn=1; nn<=ncols; nn++) colmap[cols[nn]]=nn # map names
OFS="\t" # to align output
for (nn=1; nn<=nocol; nn++) printf("%s%s",ocols[nn],OFS)
printf("\n") # output header line
}
(FNR>1) { # read data
for (nn=1; nn<=nocol; nn++) {
if (nn>1) printf(OFS) # pad
if (ocols[nn] in colmap) { printf("%s",$(colmap[ocols[nn]])) }
else { printf "--" } # named column not in data
}
printf("\n") # wrap line
}
$ nawk -f mycols.awk -v col=ID_3,ID_15 data
ID_3 ID_15
4 6
4 8
3 8
Perl, just a variation on the above with some perl idioms to confuse/entertain:
use strict;
use warnings;
our #ocols=split(/,/,$ENV{cols}); # cols contains named columns
our $nocol=scalar(#ocols);
our ($nn,%colmap);
$,="\t"; # OFS equiv
# while (<>) {...} implicit with perl -an
if ($. == 1) { # FNR equiv
%colmap = map { $F[$_] => $_+1 } 0..$#F ; # create name map hash
$colmap{vals}=0; # name anon 1st col
print #ocols,"\n"; # output header
} else {
for ($nn = 0; $nn < $nocol; $nn++) {
print "\t" if ($nn>0);
if (exists($colmap{$ocols[$nn]})) { printf("%s",$F[$colmap{$ocols[$nn]}]) }
else { printf("--") } # named column not in data
}
printf("\n")
}
$ cols="ID_3,ID_15" perl -an mycols.pl < data
That uses an environment variable to skip effort parsing the command line. It needs the perl options -an which set up field-splitting and an input read loop (much like awk does).
And with sqlite (I used v3.11, v3.8 or later is required for useful .import I believe). This uses an in-memory temporary database (name a file if too large for memory, or for a persistent copy of the parsed data), and automatically creates a table based on the first line. The advantages here are that you might not need any scripting at all, and you can perform multiple queries on your data with just one parse overhead.
You can skip this next step if you have a single hard-tab delimiting the columns, in which case replace .mode csv with .mode tab in the sqlite example below.
Otherwise, to convert your data to a suitable CSV-ish format:
nawk -v OFS="," '(FNR==1){$0="vals " $0} {$1=$1;print} < data > data.csv
This adds a dummy first column "vals" to the first line, then prints each line as comma-separated, it does this by a seemingly pointless assignment to $1, but this causes $0 to be recomputed replacing FS (space/tab) with OFS (comma).
$ sqlite3
sqlite> .mode csv
sqlite> .import data.csv mytable
sqlite> .schema mytable
CREATE TABLE mytable(
"vals" TEXT,
"ID_1" TEXT,
"ID_2" TEXT,
"ID_3" TEXT,
"ID_6" TEXT,
"ID_15" TEXT
);
sqlite> select ID_3,ID_15 from mytable;
ID_3,ID_15
4,6
4,8
3,8
sqlite> .mode column
sqlite> select ID_3,ID_15 from mytable;
ID_3 ID_15
---------- ----------
4 6
4 8
3 8
Use .once or .output to send output to a file (sqlite docs). Use .headers on or .headers off as required.
sqlite is quite happy to create an unnamed column, so you don't have to add a name to the first column of the header line, but you do need to make sure the number of columns is the same for all input lines and formats.
If you get "expected X columns but found Y" errors during the .import then you'll need to clean up the data format a little for this.
$ cat c.awk
NR == 1 {
for (i=1; i<=NF; ++i) {
if ($i == "ID_3") col_3 = (i + 1)
if ($i == "ID_15") col_15 = (i + 1)
}
print "ID_3", "ID_15"
}
NR > 1 { print $col_3, $col_15 }
$ awk -f c.awk c.txt
ID_3 ID_15
4 6
4 8
3 8
You could go for something like this:
BEGIN {
keys["ID_3"]
keys["ID_15"]
}
NR == 1 {
for (i = 1; i <= NF; ++i)
if ($i in keys) cols[++n] = i
}
{
for (i = 1; i <= n; ++i)
printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS)
}
Save the script to a file and run it like awk -f script.awk file.
Alternatively, as a "one-liner":
awk 'BEGIN { keys["ID_3"]; keys["ID_15"] }
NR == 1 { for (i = 1; i <= NF; ++i) if ($i in keys) cols[++n] = i }
{ for (i = 1; i <= n; ++i) printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS) }' file
Before the file is processed, keys are set in the keys array, corresponding to the column headings of interest.
On the first line, record all the column numbers that contain one of the keys in the cols array.
Loop through each of the cols and print them out, followed by either the output field separator OFS or the output record separator ORS, depending on whether it's the last one. $(cols[i]+(NR>1)) handles the fact that rows after the first have an extra field at the start, because NR>1 will be true (1) for those lines and false (0) for the first line.
Try below script:
#!/bin/sh
file="$1"; shift
awk -v cols="$*" '
BEGIN{
split(cols,C)
OFS=FS="\t"
getline
split($0,H)
for(c in C){
for(h in H){
if(C[c]==H[h])F[i++]=h
}
}
}
{ l="";for(f in F){l=l $F[f] OFS}print l }
' "$file"
In command line type:
[sumit.gupta#rpm01 ~]$ test.sh filename ID_3 ID_5
Related
I am new to shell scripting.. I want to disribute all the data of a file in a table format and redirect the output into another file.
I have below input file File.txt
Fruit_label:1 Fruit_name:Apple
Color:Red
Type: S
No.of seeds:10
Color of seeds :brown
Fruit_label:2 fruit_name:Banana
Color:Yellow
Type:NS
I want it to looks like this
Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds
1 | apple | red | S | 10 | brown
2 | banana| yellow | NS
I want to read all the data line by line from text file and make the headerlike fruit_label,fruit_name,color,type, no.of seeds, color of seeds and then print all the assigned value in rows.All the above data is different for different fruits for ex. banana dont have seeds so want to keep its row value as blank ..
Can anyone help me here.
Another approach, is a "Decorate & Process" approach. What is "Decorate & Process"? To Decorate is to take the text you have and decorate it with another separator to make field-splitting easier -- like in your case your fields can contain included whitespace along with the ':' separator between the field-names and values. With your inconsistent whitespace around ':' -- that makes it a nightmare to process ... simply.
So instead of worrying about what the separator is, think about "What should the fields be?" and then add a new separator (Decorate) between the fields and then Process with awk.
Here sed is used to Decorate your input with '|' as separators (a second call eliminates the '|' after the last field) and then a simpler awk process is used to split() the fields on ':' to obtain the field-name and field-value where the field-value is simply printed and the field-names are stored in an array. When a duplicate field-name is found -- it is uses as seen variable to designate the change between records, e.g.
sed -E 's/([^:]+:[[:blank:]]*[^[:blank:]]+)[[:blank:]]*/\1|/g' file |
sed 's/|$//' |
awk '
BEGIN { FS = "|" }
{
for (i=1; i<=NF; i++) {
if (split ($i, parts, /[[:blank:]]*:[[:blank:]]*/)) {
if (! n || parts[1] in fldnames) {
printf "%s %s", n ? "\n" : "", parts[2]
delete fldnames
n = 1
}
else
printf " | %s", parts[2]
fldnames[parts[1]]++
}
}
}
END { print "" }
'
Example Output
With your input in file you would have:
1 | Apple | Red | S | 10 | brown
2 | Banana | Yellow | NS
You will also see a "Decorate-Sort-Undecorate" used to sort data on a new non-existent columns of values by "Decorating" your data with a new last field, sorting on that field, and then "Undecorating" to remove the additional field when sorting is done. This allow sorting by data that may be the sum (or combination) of any two columns, etc...
Here is my solution. It is a new year gift, usually you have to demonstrate what you have tried so far and we help you, not do it for you.
Disclaimer some guru will probably come up with a simpler awk version, but this works.
File script.awk
# Remove space prefix
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
# Remove space suffix
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
# Remove both suffix and prefix spaces
function trim(s) { return rtrim(ltrim(s)); }
# Initialise or reset a fruit array
function array_init() {
for (i = 0; i <= 6; ++i) {
fruit[i] = ""
}
}
# Print the content of the fruit
function array_print() {
# To keep track if something was printed. Yes, print a carriage return.
# To avoid printing a carriage return on an empty array.
printedsomething = 0
for (i = 0; i <= 6; =+i) {
# Do no print if the content is empty
if (fruit[i] != "") {
printedsomething = 1
if (i == 1) {
# The first field must be further split, to remove "Fruit_name"
# Split on the space
split(fruit[i], temparr, / /)
printf "%s", trim(temparr[1])
}
else {
printf " | %s", trim(fruit[i])
}
}
}
if ( printedsomething == 1 ) {
print ""
}
}
BEGIN {
FS = ":"
print "Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds"
array_init()
}
/Fruit_label/ {
array_print()
array_init()
fruit[1] = $2
fruit[2] = $3
}
/Color:/ {
fruit[3] = $2
}
/Type/ {
fruit[4] = $2
}
/No.of seeds/ {
fruit[5] = $2
}
/Color of seeds/ {
fruit[6] = $2
}
END { array_print() }
To execute, call awk -f script.awk File.txt
awk processes a file line per line. So the idea is to store fruit information into an array.
Every time the line "Fruit_label:....." is found, print the current fruit and start a new one.
Since each line is read in sequence, you tell awk what to do with each line, based on a pattern.
The patterns are what are enclosed between // characters at the beginning of each section of code.
Difficulty: since the first line contains 2 information on every fruit, and I cut the lines on : char, the Fruit_label will include "Fruit_name".
I.e.: the first line is cut like this: $1 = Fruit_label, $2 = 1 Fruit_name, $3 = Apple
This is why the array_print() function is so complicated.
Trim functions are there to remove spaces.
Like for the Apple, Type: S when split on the : will result in S
If it meets your requirements, please see https://stackoverflow.com/help/someone-answers to accept it.
So I have various .csv files in a directory of the same structure with first row as the header and first column as labels. Say file 1 is as below:
name,value1,value2,value3,value4,......
name1,100,200,0,0,...
name2,101,201,0,0,...
name3,102,202,0,0,...
name4,103,203,0,0,...
....
File2:
name,value1,value2,value3,value4,......
name1,1000,2000,0,0,...
name2,1001,2001,0,0,...
name3,1002,2002,0,0,...
name4,1003,2003,0,0,...
....
All the .csv files have the same structure with the same number of rows and columns.
What I want is something that looks like this:
name,value1,value2,value3,value4,......
name1,1100,2200,0,0,...
name2,1102,2202,0,0,...
name3,1104,2204,0,0,...
name4,1103,2206,0,0,...
....
Where the all the value columns in the last file will be the sum of corresponding values in those columns of all the .csv files. So under value1 in the resulting file I should have 1000+100+...+... and so on.
The number of .csv files isn't fixed, so I think I'll need a loop.
How do I achieve this with a bash script on a Linux machine.
Thanks!
With AWK, try something like:
awk '
BEGIN {FS=OFS=","}
FNR==1 {header=$0} # header line
FNR>1 {
sum[FNR,1] = $1 # name column
for (j=2; j<=NF; j++) {
sum[FNR,j] += $j
}
}
END {
print header
for (i=2; i<=FNR; i++) {
for (j=1; j<=NF; j++) {
$j = sum[i,j]
}
print
}
}' *.csv
It iterates over lines and columns accumulating the value into the simulated 2-dimensional array sum.
You do not have to explicitly loop over csv files. AWK automatically does it
for you.
After reading all csv files it reports the amounts for each line and column in the END block.
Note that gawk 4.0 and newer version support true multi-dimensional array.
Hope this helps.
EDIT
In order to calculate the average instead of sum, try:
awk '
BEGIN {FS=OFS=","}
FNR==1 {header=$0} # header line
FNR>1 {
sum[FNR,1] = $1 # names column
for (j=2; j<=NF; j++) {
sum[FNR,j] += $j
}
}
END {
print header
files = ARGC - 1 # number of csv files
for (i=2; i<=FNR; i++) {
$1 = sum[i,1] # another treatment for the 1st column
for (j=2; j<=NF; j++) {
$j = sum[i,j] / files
# if you want to specify the number of decimal places,
# try something like:
# $j = sprintf("%.2f", sum[i,j] / files)
}
print
}
}' *.csv
Using Perl
/tmp> cat f1.csv
name,value1,value2,value3,value4
name1,100,200,0,0
name2,101,201,0,0
name3,102,202,0,0
name4,103,203,0,0
/tmp> cat f2.csv
name,value1,value2,value3,value4
name1,1000,2000,0,0
name2,1001,2001,0,0
name3,1002,2002,0,0
name4,1003,2003,0,0
/tmp>
/tmp> cat csv_add.ksh
perl -F, -lane '
#FH=#F if $.==1;
if($.>1) {
if( $F[0] ~~ #names )
{
#t1=#{ $kv{$F[0]} };
for($i=0;$i<$#t1-1;$i++) { $t1[$i]+=$F[$i+1] }
$kv{$F[0]}=[ #t1 ];
}
else {
$kv{$F[0]}=[ #F[1..$#F] ];
push(#names,$F[0]);
}
}
END { print join(" ",#FH); for(#names) { print "$_,".join(",",#{$kv{$_}}) }}
close(ARGV) if eof
' f1.csv f2.csv
/tmp>
/tmp> csv_add.ksh
name value1 value2 value3 value4
name1,1100,2200,0,0
name2,1102,2202,0,0
name3,1104,2204,0,0
name4,1106,2206,0,0
/tmp>
Sample input data:
Col1, Col2
120000,1261
120000,119879
120000,117737
120000,14051
200000,58411
200000,115292
300000,279892
120000,98572
250000,249598
120000,14051
......
I used Excel with follow steps:
Col3=Col2/Col1.
Format Col3 with percentage
Use countif to group by Col3
How to do this task with awk or other way in linux command line ?
Expected result:
percent|count
0-20% | 10
21-50% | 5
51-100%| 10
I calculated the percent but i'm still finding the way to group by Col3
cat input.txt |awk -F"," '$3=100*$2/$1'
awk approach:
awk 'BEGIN {
FS=",";
OFS="|";
}
(NR > 1){
percent = 100 * $2 / $1;
if (percent <= 20) {
a["0-20%"] += 1;
} else if (percent <= 50) {
a2 += 1;
a["21-50%"] += 1;
} else {
a["51-100%"] += 1;
}
}
END {
print "percent", "count"
for (i in a) {
print i, a[i];
}
}' data
Sample output:
percent|count
0-20%|3
21-50%|1
51-100%|6
A generic self documented. Need some fine tuning depending on group name in result (due to +1% or not but not the real purpose)
awk -F ',' -v Step='0|20|50|100' '
BEGIN {
# define group
Gn = split( Step, aEdge, "|")
}
NR>1{
# Define wich percent
L = $2 * 100 / ($1>0 ? $1 : 1)
# in which group
for( j=1; ( L < aEdge[j] || L >= aEdge[j+1] ) && j < Gn;) j++
# add to group
G[j]++
}
# print result ordered
END {
print "percent|count"
for( i=1;i<Gn;i++) printf( "%d-%d%%|%d\n", aEdge[i], aEdge[i+1], G[i])
}
' data
another awk with parametric bins and formatted output.
$ awk -F, -v OFS=\| -v bins='20,50,100' '
BEGIN {n=split(bins,b)}
NR>1 {for(i=1;i<=n;i++)
if($2/$1 <= b[i]/100)
{a[b[i]]++; next}}
END {print "percent","count";
b[0]=-1;
for(i=1;i<=n;i++)
printf "%-7s|%3s\n", b[i-1]+1"-"b[i]"%",a[b[i]]}' file
percent|count
0-20% | 3
21-50% | 1
51-100%| 6
Pure bash:
# arguments are histogram boundaries *in ascending order*
hist () {
local lower=0$(printf '+(val*100>sum*%d)' "$#") val sum count n;
set -- 0 "$#" 100;
read -r
printf '%7s|%5s\n' percent count;
while IFS=, read -r sum val; do echo $((lower)); done |
sort -n | uniq -c |
while read count n; do
printf '%2d-%3d%%|%5d\n' "${#:n+1:2}" $count;
done
}
Example:
$ hist 20 50 < csv.dat
percent|count
0- 20%| 3
20- 50%| 1
50-100%| 6
Potential Issue: Does not print intervals with no values:
$ hist 20 25 45 50 < csv.dat
percent|count
0- 20%| 3
25- 45%| 1
50-100%| 6
Explanation:
lower is set to an expression which will count the number of percentages less than 100*val/num
The list of intervals is augmented with 0 and 100 so that the limits print correctly
The header line is ignored
The output header is printed
For each csv row, read the variables $num and $val and send the numeric evaluation of $lower (which uses those variables) to...
count the number of instances of each interval count...
and print the interval and count
Another, in GNU awk, using switch and regex to identify the values (since parsing was tagged in OP):
NR>1{
switch(p=$2/$1){
case /0\.[01][0-9]|\.20/:
a["0-20%"]++;
break;
case /\.[2-4][0-9]|\.50/:
a["21-50%"]++;
break;
default:
a["51-100%"]++
}
}
END{ for(i in a)print i, a[i] }
Run it:
$ awk -F, -f program.awk file
21-50% 1
0-20% 3
51-100% 6
I wanted to split the large file (185 Million records) to more than one files based on one column value.The file is .dat file and the delimiter used inbetween the columns are ^A (\u0001).
The File content is like this:
194^A1^A091502^APR^AKIMBERLY^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A1^A091502^APR^AJOHN^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A^A091502^APR^AASHLEY^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A3^A091502^APR^APETER^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A4^A091502^APR^AJOE^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
now i wanted to split the file based on second column value, if you see the third row the second column value is empty, so all the empty rows should come one file , remaining all should come one file.
Please help me on this. I tried to google, it seems we should use awk for this.
Regards,
Shankar
With awk:
awk -F '\x01' '$2 == "" { print > "empty.dat"; next } { print > "normal.dat" }' filename
The file names can be chosen arbitrarily, of course. print > "file" prints the current record to a file named "file".
Addendum re: comment: Removing the column is a little trickier but certainly feasible. I'd use
awk -F '\x01' 'BEGIN { OFS = FS } { fname = $2 == "" ? "empty.dat" : "normal.dat"; for(i = 2; i < NF; ++i) $i = $(i + 1); --NF; print > fname }' filename
This works as follows:
BEGIN { # output field separator is
OFS = FS # the same as input field
# separator, so that the
# rebuilt lines are formatted
# just like they came in
}
{
fname = $2 == "" ? "empty.dat" : "normal.dat" # choose file name
for(i = 2; i < NF; ++i) { # set all fields after the
$i = $(i + 1) # second back one position
}
--NF # let awk know the last field
# is not needed in the output
print > fname # then print to file.
}
Let's assume I have scientific data, all numbers arranged in a single column but representing an intensities matrix of n (width) by m (height). The column of the input file has in total n * m rows. An input example may look like that:
1
2
3
......
30
The new output should be such that I have n new columns with m rows. Sticking to my example with 30 fields input and n = 3, m = 10, I would need an output file like this (separator does not matter much, could be a blank, a tab etc.):
1 11 21
2 12 22
... ... ...
10 20 30
I use gawk under Windows. Please note that there are no special FS, more real-world examples are like 60 * 60 or bigger.
If you are not limited to awk but have GNU core-utils (cygwin, native, ..) then the simplest solution is to use pr:
pr -ts" " --columns 3 file
I believe this will do:
awk '
{ split($0,data); }
END {
m = 10;
n = 3;
for( i = 1; i<=m; i++ ) {
for( j = 0; j<n; j++ ) {
printf "%s ", data[j*m + i] # output data plus space in one line
}
# here you might want to start a new line though you did not ask for it:
printf "\n";
}
}' inputfile
I might have the index counting wrong but I am sure you can figure it out. The trick is the split in the first line. It splits your input on whitespace and creates an array data. The END block runs after processing your file and just accesses data by index. Note array indices count from 0.
Assumption is all data is in a single line. Your question isn't quite clear on this. If it is on several lines you'd have to read it into the array differently.
Hope this gets you started.
EDIT
I notice you changed your question while I was answering it. So change
{ split($0,data); }
to
{ data[++i] = $1; }
to account for the input being on different lines. Actually, this would give you the option to read it into a two dimensional array in the first place.
EDIT 2
Read two dimensional array
To read as a two dimensional array assuming m and n are known beforehand and not encoded in the input somehow:
awk '
BEGIN {
m = 10;
n = 3;
}
{
for( i = 0; i<m; i++ ) {
for( j = 0; j<n; j++ ) {
data[i,j] = $0;
}
}
# do something with data
}' inputfile
However, since you only want to reformat your data you could do it immediately. Combining the two solutions getting rid of data and passing m and n on the command line:
awk -v m=10 -v n=3'
{
for( i = 0; i<m; i++ ) {
for( j = 0; j<n; j++ ) {
printf "%s ", $0 # output data plus space in one line
}
printf "\n";
}
}' inputfile
Here is a fairly simple solution (in the example I've set n equal to 3; plug in the appropriate value for n):
awk -v n=3 '{ row = row $1 " "; if (NR % n == 0) { print row; row = "" } }' FILE
This works by reading in records one line at a time concatenating each line with the preceding lines. When n lines have been concatenated, it prints the concatenated result on a single new line. This repeats until there are no more lines left in the input.
You can use the below command
paste - - - < input.txt
By default, the delimiter is TAB, to change the delimiter, use below command
paste - - - -d' ' < input.txt