extract a list of data from multiple files - linux

I would like to ask help on this. Thank you very much!
I have thousands of files, each containing 5 columns and the first column containing names.
$ cat file1
name math eng hist sci
Kyle 56 45 68 97
Angela 88 86 59 30
June 48 87 85 98
I also have a file containing a list of names that can be found in the 5-column files.
$ cat list.txt
June
Isa
Angela
Manny
Specifically, I want to extract, say, the data in the 3rd column corresponding to the list file that I have in a structured way; columns representing the thousands of files and the names as rows. If one name in the list file is not present in a 5-column file, it should be presented as 0. Additionally, columns should headed with the filenames.
$ cat output.txt
names file1 file2 file3 file4
June 87 65 67 87
Isa 0 0 0 54
Angela 86 75 78 78
Manny 39 46 0 38

Using your test files list.txt and file1 (twice) for testing. First the awk:
$ cat program.awk
function isEmpty(arr, idx) { # using #EdMorton's test for array emptiness
for (idx in arr) # for figuring out the first data file
return 0 # https://stackoverflow.com/a/20078022/4162356
return 1
}
function add(n,a) { # appending grades for the chosen ones
if(!isEmpty(a)) { # if a is not empty
for(i in n) # iterate thru all chosen ones
n[i]=n[i] (n[i]==""?"":OFS) (i in a?a[i]:0) # and append
}
}
FNR==1 { # for each new file
h=h (h==""?"":OFS) FILENAME # build header
process(n,a) # and process the previous file in hash a
}
NR==FNR { # chosen ones to hash n
n[$1]
next
}
$1 in n { # add chosen ones to a
a[$1]=$3 #
}
END {
process(n,a) # in the end
print h # print the header
for(i in n) # and names with grades
print i,n[i]
}
Running it:
$ awk -f program.awk list.txt file1 file1
list.txt file1 file1
Manny 0 0
Isa 0 0
Angela 86 86
June 87 87

$ cat awk-script
BEGIN{f_name="names"} # save the "names" to var f_name
NR==FNR{
a[$1]=$1;b[$1];next # assign 2 array a & b, which keys is the content of "list.txt"
}
FNR==1{ # a new file is scanned
f_name=f_name"\t"FILENAME; # save the FILENAME to f_name
for(i in a){
a[i]=b[i]==""?a[i]:a[i]"\t"b[i]; # flush the value of b[i] to append to the value of a[i]
b[i]=0 # reset the value of b[i]
}
}
{ if($1 in b){b[$1]=$3} } # set $3 as the value of b[$1] if $1 existed in the keys of array b
END{
print f_name; # print the f_name
for(i in a){
a[i]=b[i]==""?a[i]:a[i]"\t"b[i]; # flush the the value of b[i] to a[i] belongs to the last file
print a[i] # print a[i]
}
}
Assumed more the one file (i.e., file1, file2, etc) existed, you may use the command to get the result,
$ awk -f awk-script list.txt file*
names file1 file2
Manny 0 46
Isa 0 0
Angela 86 75
June 87 65

Related

Looping though list of IDs to count matches in two columns

This is going to be a complicated one to explain so bear with me.
I am doing a blastp comparison multiple proteins all vs all and want the number of shared proteins between the genomes.
I have a large file of the query id and sequence id, example:
A A 100
A A 100
A A 100
A B 74
A B 47
A B 67
A C 73
A C 84
A C 74
A D 48
A D 74
A D 74
B A 67
B A 83
B A 44
B B 100
The file continues like that. I'd like to count the number of occurrences of A in column 1 and B in column 2. I have found a way to do this with awk:
awk -F, '$1=="A" && $2=="A"' file | wc -l
However, I have hundreds of genomes and this would involve typing the awk script thousands of times to get the different combinations. I had added the IDs from column 1 to a text file and tried a loop to loop through all the IDs for all possible combinations
for i in $(cat ID.txt); do input_file=file.csv; awk -F, '$1==$i && $2==$i' ${input_file} | wc -l; done
This is the output:
0
0
0
0
0
0
0
etc.
I'd like the output to be:
A A 60
A B 54
A C 34
A D 35
etc.
Any help would be appreciated.
If I'm understanding correctly, then you can collect the count for each pair into an array, and then print out the array once complete:
awk -F, '{++a[$1 FS $2]} END{for(entry in a){print entry, a[entry]}}' file
A,A 3
B,A 3
A,B 3
B,B 1
A,C 3
A,D 3
This is doing the following:
Increment the count in array a for the item with the key constructed from the concatenation of the first two columns, separated by the field separator FS (comma): {++a[$1 FS $2]}
Once the file processing is done END, loop through the array calling each array entry entry, for (entry in a)
In the loop, print the key/entry and the value {print entry, a[entry]}
…input… | WHINY_USERS=1 \ # Not trying to insult anyone -
# this is a special shell parameter
# recognized by mawk-1 to have array
# indices pre-sorted, somewhat similar to gawk's
#
# PROCINFO["sorted_in"]="#ind_str_asc"
mawk '{__[$!--NF]—-} END { for(_ in __) { print _,-__[_] } }' OFS=',' FS='[, \t]+'
A,A,3
A,B,3
A,C,3
A,D,3
B,A,3
B,B,1
if there's a chance more in than 3 columns in input, then do :
{m,n,g}awk '
BEGIN { _ += _ ^= FS = "["(OFS=",")" \t]+"
} { __[$!(NF=_)]++
} END {
for(_ in __) { print _, __[_] } }'
let $1 = $1 take care of placing the comma in between columns 1 and 2 instead of having to manually do it

Processing multiple file with different number of fields using awk [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I have many files with different data separated by space and newline.
Each file contain different number of parameter and corresponding data after keyword "alter#"
File #1:
encal cfreq trick
temp alter#
10 20 30
40 50
File #2:
encal tie trick
alter#
12 34 54
73
File #3:
encal tie trick temp
trip miles
alter#
12 34 54 56
73 34
5
I wante my output file to combine all data in one file as tab separated data.
Filename encal cfreq trick tie temp trip miles alter
File1 10 20 30 NA 40 NA NA 50
File2 12 NA 54 34 NA NA NA 73
File3 17 NA 76 34 56 73 34 5
I tried to look at code as shown here Process multiple file using awk
but my code got very verbose and I lost my way. Can someone help me here ? I am not posting my code since I dont want to give wrong start or waste people time.
Thank you for your time in advance.
PS:Format for file1, file2, file3 is correct. My software tool is outputting file exactly same format file as I have shown. I want my output file to tab separated
$ cat tst.awk
BEGIN {
RS = ""
FS = "[#]"
OFS = "\t"
}
FNR == 1 { numFiles++ }
{
split($1,tags," ")
split($2,vals," ")
for (i in tags) {
tag = tags[i]
val = vals[i]
f[numFiles,tag] = val
allTags[tag]
}
}
END {
printf "File"
for (tag in allTags) {
printf "%s%s", OFS, tag
}
print ""
for (fileNr=1; fileNr<=numFiles; fileNr++) {
printf "%s", ARGV[fileNr]
for (tag in allTags) {
val = ( (fileNr,tag) in f ? f[fileNr,tag] : "NA" )
printf "%s%s", OFS, val
}
print ""
}
}
.
$ awk -f tst.awk file1 file2 file3
File trick temp miles alter tie cfreq encal trip
file1 30 40 NA 50 NA 20 10 NA
file2 54 NA NA 73 34 NA 12 NA
file3 54 56 34 5 34 NA 12 73

bash for loops not looping (awk, bash, linux)

Here is a sample dataset (10 cols, 2 rows):
8 1 4 10 7 9 2 3 6 5
0.001475 10.001 20.25 30.5 40.75 51 61.25 71.5 81.75 92
I would like to output ten files for each dataset. Each file will contain a unique value from the second row, and the filename will contain the value from the corresponding column in the first row.
(example: a file containing .001475, called foo_bar_8.1D
See my code below, intended for use on the following datasets:
OrderTimesKC_voxel_tuning_1.txt
OrderTimesKC_voxel_tuning_2.txt
OrderTimesKC_voxel_tuning_3.txt
OrderTimesKC_voxel_tuning_4.txt
OrderTimesKC_voxel_tuning_5.txt
Script:
subj='KC'
for j in {1..5}; do
for x in {1..10}; do
a=$(awk 'FNR == 1 {print $"$x"}' OrderTimes"$subj"_voxel_tuning_"$j".txt) #a == row 1, column x
b=$(awk 'FNR == 2 {print $"$x"}' OrderTimes"$subj"_voxel_tuning_"$j".txt) #b == row 2, column x
echo $b > voxTim_"$subj"_"$j"_"$a".1D
done
done
the current outputted files are:
voxTim_KC_1_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_2_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_3_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_4_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_5_8?1?4?10?7?9?2?3?6?5.1D
these contain ten values per file, indicating that it is not looping correctly.
what I want is:
voxTim_KC_1_1.1D, voxTim_KC_1_2.1D, voxTim_KC_1_3.1D.....
voxTim_KC_2_1.1D, voxTim_KC_2_2.1D, voxTim_KC_2_3.1D.....
and so on..
Thank you!
awk to the rescue!
You can use awk more effectively, for example this script will do the extraction of the two values from each input file and create 10 (or actual number of columns) files with the data
$ awk 'FNR==1{c++; n=split($0,r1); next}
FNR==2{split($0,r2);
for(i=1;i<=n;i++) print r2[i] > "file."c"."r1[i]".1D"}' input1 input2
will create set of files for given input1 and input2 files. You can use this as a template and get rid of the for loops.
For example
$ tail -n 2 *
==> input1 <==
8 1 4 10 7 9 2 3 6 5
0.001475 10.001 20.25 30.5 40.75 51 61.25 71.5 81.75 92
==> input2 <==
98 91 94 910 97 99 92 93 96 95
0.001475 10.001 20.25 30.5 40.75 51 61.25 71.5 81.75 92
after running the script
$ ls
file.1.1.1D file.1.2.1D file.1.4.1D file.1.6.1D file.1.8.1D file.2.91.1D file.2.92.1D file.2.94.1D file.2.96.1D file.2.98.1D input1
file.1.10.1D file.1.3.1D file.1.5.1D file.1.7.1D file.1.9.1D file.2.910.1D file.2.93.1D file.2.95.1D file.2.97.1D file.2.99.1D input2
and contents
$ tail -n 2 file.1*
==> file.1.1.1D <==
10.001
==> file.1.10.1D <==
30.5
==> file.1.2.1D <==
61.25
==> file.1.3.1D <==
71.5
==> file.1.4.1D <==
20.25
etc...
actually, you can simply it further to
$ awk 'FNR==1{c++; n=split($0,r1)}
FNR==2{for(i=1;i<=n;i++) print $i > ("file."c"."r1[i]".1D")}' input1 input2
Just with bash:
subj=KC
for j in {1..5}; do
{
read -ra a # read the 1st line into array 'a'
read -ra b # read the 2nd line into array 'b'
for i in {0..9}; do
echo "${b[i]}" > "voxTim_${subj}_${j}_${a[i]}.1D"
done
} < "OrderTimes${subj}_voxel_tuning_${j}.txt"
done

Find lines with a common value in a particular column

Suppose I have a file like this
5 kata 45 buu
34 tuy 3 rre
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
21 plk 1 uio
23 kata 90 ty
I want to have in output only the lines that contains repetead values on the 4th column. Therefore, my desired output would be this one:
5 kata 45 buu
21 ppo 90 ty
21 ret 60 buu
09 ret 89 ty
23 kata 90 ty
How can I perform this task?
I can identify and isolate the column of my interest with:
awk -F"," '{print $4}' file1 > file1_temp
and then check if there are repeated values and how many with:
awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' file1_temp
but that's not definitely what I would like to do..
A simple way to preserve the ordering would be to run through the file twice. The first time, keep a record of the counts, then print the ones with a count greater than 1 on the second pass:
awk 'NR == FNR { ++count[$4]; next } count[$4] > 1' file file
If you prefer not to loop through the file twice, you can keep track of things in a few arrays and do the printing in the END block:
awk '{ line[NR] = $0; col[NR] = $4; ++count[$4] }
END { for (i = 1; i <= NR; ++i) if (count[col[i]] > 1) print line[i] }' file
Here line stores the contents of the whole line, col stores the fourth column and count does the same as before.

Compare two files having different column numbers and print the requirement to a new file if condition satisfies

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?
You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

Resources