Processing multiple file with different number of fields using awk [closed] - linux

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I have many files with different data separated by space and newline.
Each file contain different number of parameter and corresponding data after keyword "alter#"
File #1:
encal cfreq trick
temp alter#
10 20 30
40 50
File #2:
encal tie trick
alter#
12 34 54
73
File #3:
encal tie trick temp
trip miles
alter#
12 34 54 56
73 34
5
I wante my output file to combine all data in one file as tab separated data.
Filename encal cfreq trick tie temp trip miles alter
File1 10 20 30 NA 40 NA NA 50
File2 12 NA 54 34 NA NA NA 73
File3 17 NA 76 34 56 73 34 5
I tried to look at code as shown here Process multiple file using awk
but my code got very verbose and I lost my way. Can someone help me here ? I am not posting my code since I dont want to give wrong start or waste people time.
Thank you for your time in advance.
PS:Format for file1, file2, file3 is correct. My software tool is outputting file exactly same format file as I have shown. I want my output file to tab separated

$ cat tst.awk
BEGIN {
RS = ""
FS = "[#]"
OFS = "\t"
}
FNR == 1 { numFiles++ }
{
split($1,tags," ")
split($2,vals," ")
for (i in tags) {
tag = tags[i]
val = vals[i]
f[numFiles,tag] = val
allTags[tag]
}
}
END {
printf "File"
for (tag in allTags) {
printf "%s%s", OFS, tag
}
print ""
for (fileNr=1; fileNr<=numFiles; fileNr++) {
printf "%s", ARGV[fileNr]
for (tag in allTags) {
val = ( (fileNr,tag) in f ? f[fileNr,tag] : "NA" )
printf "%s%s", OFS, val
}
print ""
}
}
.
$ awk -f tst.awk file1 file2 file3
File trick temp miles alter tie cfreq encal trip
file1 30 40 NA 50 NA 20 10 NA
file2 54 NA NA 73 34 NA 12 NA
file3 54 56 34 5 34 NA 12 73

Related

How to print contents of column fields that have strings composed of "n" character/s using bash?

Say I have a file which contains:
22 30 31 3a 31 32 3a 32 " 0 9 : 1 2 : 2
30 32 30 20 32 32 3a 31 1 2 7 2 2 : 1
And, I want to print only the column fields that have string composed of 1 character. I want the output to be like this:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
Then, I want to print only those strings that are composed of two characters, the output should be:
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
I am a beginner and I really don't know how to do this. Thanks for your help!
Could you please try following, I am trying it in a different way for provided samples. Written and tested with provided samples only.
For getting values before BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,1,start[arr]-1)
sub(/ +$/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows.
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
For getting values after BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,start[arr])
sub(/^ +/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
You can try
awk '{for(i=1;i<=NF;++i)if(length($i)==1)printf("%s ", $i);print("")}'
For each field, check the length and print it if it's desired. You may pass the -F option to awk if it's not separated by blanks.
The awk script is expanded as:
for( i = 1; i <= NF; ++i )
if( length( $i ) == 1 )
printf( "%s ", $i );
print( "" );
The print outside loop is to print a newline after each input line.
Assuming all the columns are tab-separated (So you can have a space as a column value like the second line of your sample), easy to do with a perl one-liner:
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^.$/ } #F' foo.txt
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^..$/ } #F' foo.txt
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31

Datamash: Transposing the column into rows based on group in bash

I have a tab delim file with a 2 columns like following
A 123
A 23
A 45
A 67
B 88
B 72
B 50
B 23
C 12
C 14
I want to transpose with the above data based on the first column like following
A 123 23 45 67
B 88 72 50 23
C 12 14
I tried the datamash transpose < input-file.txt but it didnt yield the output as expected.
One awk version:
awk '{printf ($1!=f?"\n%s":" "$2),$0;f=$1}' file
A 123 23 45 67
B 88 72 50 23
C 12 14
With this version, you get on blank line, but should be fast and handle large data since no loop or array variable are used.
$1!=f?"\n%s":" "$2),$0 If first field is not equal f, print new line and all fields
if $1 = f, only print field 2.
f=$1 set f to first field
datamash --group=1 --field-separator=' ' collapse 2 <file | tr ',' ' '
Output:
A 123 23 45 67
B 88 72 50 23
C 12 14
Input must be sorted, as in the question.
This might work for you (GNU sed):
sed -E ':a;N;s/^((\S+)\s+.*)\n\2/\1/;ta;P;D' file
Append the next line and if the first field of the first line is the same as the first field of the second line, remove the newline and the first field of the second line. Print the first line in the pattern space and then delete it and the following newline and repeat.

extract a list of data from multiple files

I would like to ask help on this. Thank you very much!
I have thousands of files, each containing 5 columns and the first column containing names.
$ cat file1
name math eng hist sci
Kyle 56 45 68 97
Angela 88 86 59 30
June 48 87 85 98
I also have a file containing a list of names that can be found in the 5-column files.
$ cat list.txt
June
Isa
Angela
Manny
Specifically, I want to extract, say, the data in the 3rd column corresponding to the list file that I have in a structured way; columns representing the thousands of files and the names as rows. If one name in the list file is not present in a 5-column file, it should be presented as 0. Additionally, columns should headed with the filenames.
$ cat output.txt
names file1 file2 file3 file4
June 87 65 67 87
Isa 0 0 0 54
Angela 86 75 78 78
Manny 39 46 0 38
Using your test files list.txt and file1 (twice) for testing. First the awk:
$ cat program.awk
function isEmpty(arr, idx) { # using #EdMorton's test for array emptiness
for (idx in arr) # for figuring out the first data file
return 0 # https://stackoverflow.com/a/20078022/4162356
return 1
}
function add(n,a) { # appending grades for the chosen ones
if(!isEmpty(a)) { # if a is not empty
for(i in n) # iterate thru all chosen ones
n[i]=n[i] (n[i]==""?"":OFS) (i in a?a[i]:0) # and append
}
}
FNR==1 { # for each new file
h=h (h==""?"":OFS) FILENAME # build header
process(n,a) # and process the previous file in hash a
}
NR==FNR { # chosen ones to hash n
n[$1]
next
}
$1 in n { # add chosen ones to a
a[$1]=$3 #
}
END {
process(n,a) # in the end
print h # print the header
for(i in n) # and names with grades
print i,n[i]
}
Running it:
$ awk -f program.awk list.txt file1 file1
list.txt file1 file1
Manny 0 0
Isa 0 0
Angela 86 86
June 87 87
$ cat awk-script
BEGIN{f_name="names"} # save the "names" to var f_name
NR==FNR{
a[$1]=$1;b[$1];next # assign 2 array a & b, which keys is the content of "list.txt"
}
FNR==1{ # a new file is scanned
f_name=f_name"\t"FILENAME; # save the FILENAME to f_name
for(i in a){
a[i]=b[i]==""?a[i]:a[i]"\t"b[i]; # flush the value of b[i] to append to the value of a[i]
b[i]=0 # reset the value of b[i]
}
}
{ if($1 in b){b[$1]=$3} } # set $3 as the value of b[$1] if $1 existed in the keys of array b
END{
print f_name; # print the f_name
for(i in a){
a[i]=b[i]==""?a[i]:a[i]"\t"b[i]; # flush the the value of b[i] to a[i] belongs to the last file
print a[i] # print a[i]
}
}
Assumed more the one file (i.e., file1, file2, etc) existed, you may use the command to get the result,
$ awk -f awk-script list.txt file*
names file1 file2
Manny 0 46
Isa 0 0
Angela 86 75
June 87 65

bash for loops not looping (awk, bash, linux)

Here is a sample dataset (10 cols, 2 rows):
8 1 4 10 7 9 2 3 6 5
0.001475 10.001 20.25 30.5 40.75 51 61.25 71.5 81.75 92
I would like to output ten files for each dataset. Each file will contain a unique value from the second row, and the filename will contain the value from the corresponding column in the first row.
(example: a file containing .001475, called foo_bar_8.1D
See my code below, intended for use on the following datasets:
OrderTimesKC_voxel_tuning_1.txt
OrderTimesKC_voxel_tuning_2.txt
OrderTimesKC_voxel_tuning_3.txt
OrderTimesKC_voxel_tuning_4.txt
OrderTimesKC_voxel_tuning_5.txt
Script:
subj='KC'
for j in {1..5}; do
for x in {1..10}; do
a=$(awk 'FNR == 1 {print $"$x"}' OrderTimes"$subj"_voxel_tuning_"$j".txt) #a == row 1, column x
b=$(awk 'FNR == 2 {print $"$x"}' OrderTimes"$subj"_voxel_tuning_"$j".txt) #b == row 2, column x
echo $b > voxTim_"$subj"_"$j"_"$a".1D
done
done
the current outputted files are:
voxTim_KC_1_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_2_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_3_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_4_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_5_8?1?4?10?7?9?2?3?6?5.1D
these contain ten values per file, indicating that it is not looping correctly.
what I want is:
voxTim_KC_1_1.1D, voxTim_KC_1_2.1D, voxTim_KC_1_3.1D.....
voxTim_KC_2_1.1D, voxTim_KC_2_2.1D, voxTim_KC_2_3.1D.....
and so on..
Thank you!
awk to the rescue!
You can use awk more effectively, for example this script will do the extraction of the two values from each input file and create 10 (or actual number of columns) files with the data
$ awk 'FNR==1{c++; n=split($0,r1); next}
FNR==2{split($0,r2);
for(i=1;i<=n;i++) print r2[i] > "file."c"."r1[i]".1D"}' input1 input2
will create set of files for given input1 and input2 files. You can use this as a template and get rid of the for loops.
For example
$ tail -n 2 *
==> input1 <==
8 1 4 10 7 9 2 3 6 5
0.001475 10.001 20.25 30.5 40.75 51 61.25 71.5 81.75 92
==> input2 <==
98 91 94 910 97 99 92 93 96 95
0.001475 10.001 20.25 30.5 40.75 51 61.25 71.5 81.75 92
after running the script
$ ls
file.1.1.1D file.1.2.1D file.1.4.1D file.1.6.1D file.1.8.1D file.2.91.1D file.2.92.1D file.2.94.1D file.2.96.1D file.2.98.1D input1
file.1.10.1D file.1.3.1D file.1.5.1D file.1.7.1D file.1.9.1D file.2.910.1D file.2.93.1D file.2.95.1D file.2.97.1D file.2.99.1D input2
and contents
$ tail -n 2 file.1*
==> file.1.1.1D <==
10.001
==> file.1.10.1D <==
30.5
==> file.1.2.1D <==
61.25
==> file.1.3.1D <==
71.5
==> file.1.4.1D <==
20.25
etc...
actually, you can simply it further to
$ awk 'FNR==1{c++; n=split($0,r1)}
FNR==2{for(i=1;i<=n;i++) print $i > ("file."c"."r1[i]".1D")}' input1 input2
Just with bash:
subj=KC
for j in {1..5}; do
{
read -ra a # read the 1st line into array 'a'
read -ra b # read the 2nd line into array 'b'
for i in {0..9}; do
echo "${b[i]}" > "voxTim_${subj}_${j}_${a[i]}.1D"
done
} < "OrderTimes${subj}_voxel_tuning_${j}.txt"
done

Find lines with a common value in a particular column

Suppose I have a file like this
5 kata 45 buu
34 tuy 3 rre
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
21 plk 1 uio
23 kata 90 ty
I want to have in output only the lines that contains repetead values on the 4th column. Therefore, my desired output would be this one:
5 kata 45 buu
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
23 kata 90 ty
How can I perform this task?
I can identify and isolate the column of my interest with:
awk -F"," '{print $4}' file1 > file1_temp
and then check if there are repeated values and how many with:
awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' file1_temp
but that's not definitely what I would like to do..
A simple way to preserve the ordering would be to run through the file twice. The first time, keep a record of the counts, then print the ones with a count greater than 1 on the second pass:
awk 'NR == FNR { ++count[$4]; next } count[$4] > 1' file file
If you prefer not to loop through the file twice, you can keep track of things in a few arrays and do the printing in the END block:
awk '{ line[NR] = $0; col[NR] = $4; ++count[$4] }
END { for (i = 1; i <= NR; ++i) if (count[col[i]] > 1) print line[i] }' file
Here line stores the contents of the whole line, col stores the fourth column and count does the same as before.

Resources