Compare two files having different column numbers and print the requirement to a new file if condition satisfies - linux

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?

You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

Related

How do I find the number of elements in a column greater than a given number in Linux?

I have a text file of list of students with Marks and I want to find how many of them secured more than 80 in Maths, Physics and then Maths and Physics combined. What should be the Linux command to do this?
The text file is here:
#name maths phy
Manila 78 29
Shikhar 49 78
Vandana 65 87
Priyansh 75 22
Bina 52 69
Chitransh 98 93
William 88 73
Kaushal 38 85
Dilruba 65 94
Lalremruata 34 45
Qasim 58 62
Nitya 81 89
Jennita 96 91
Shobha 71 63
Talim 77 88
This can be achieved using awk (don't use grep because this is not fit for number arithmetic). An example:
cat test.txt | awk '{if ($2>80 || $3>80) print $1 " " $2 " " $3}'
This needs to be improved: how to remove the cat command, how to check the sum of both columns, why is the title present, ...? But at least you have something to start.
Try this, and adapt it to your taste:
$ awk '/^[^#]/{
limit = 80;
comb = $2 + $3;
if ($2 > limit && $3 > limit)
print $1, $2, $3, "both";
else if ($2 > limit)
print $1, $2, $3, "maths";
else if ($3 > limit)
print $1, $2, $3, "physics";
else if (comb > limit)
print $1, $2, $3, "combined";
}' <<EOF
#name maths phy
Manila 78 29
Shikhar 49 78
Vandana 65 87
Priyansh 75 22
Bina 52 69
Chitransh 98 93
William 88 73
Kaushal 38 85
Dilruba 65 94
Lalremruata 34 45
Qasim 58 62
Nitya 81 89
Jennita 96 91
Shobha 71 63
Talim 77 88
EOF
which produces the following:
Manila 78 29 combined
Shikhar 49 78 combined
Vandana 65 87 physics
Priyansh 75 22 combined
Bina 52 69 combined
Chitransh 98 93 both
William 88 73 maths
Kaushal 38 85 physics
Dilruba 65 94 physics
Qasim 58 62 combined
Nitya 81 89 both
Jennita 96 91 both
Shobha 71 63 combined
Talim 77 88 physics
If you want it to be read from a file, then use it as:
$ awk '/^[^#]/{
limit = 80;
comb = $2 + $3;
if ($2 > limit && $3 > limit)
print $1, $2, $3, "both";
else if ($2 > limit)
print $1, $2, $3, "maths";
else if ($3 > limit)
print $1, $2, $3, "physics";
else if (comb > limit)
print $1, $2, $3, "combined";
}' marks.txt
the awk script
will read all lines that have not a # in the first column, what will allow you to introduce comments, as the header with the course name.
will permit you to configure easily a trigger level, as the limit variable is initialized in one place and used as a constant.
you can adapt the criterion to what you want to get, even the students that don't pass any of the two courses.
Note:
If you want to do variable substitution inside the awk script, beware that awk uses $<n> notation to refer to input words, which is also used and expanded by the shell, so you you need to do shell variable expansion, the best approach is to close the single quotes and open double quotes on the variable to be expanded only, so you don't get confused which variable will be expanded by the shell and which by awk. Example:
$ export TRIGGER=95
$ awk '/^[^#]/{
limit = '"$TRIGGER"';
comb = $2 + $3;
if ($2 > limit && $3 > limit)
print $1, $2, $3, "both";
else if ($2 > limit)
print $1, $2, $3, "maths";
else if ($3 > limit)
print $1, $2, $3, "physics";
else if (comb > limit)
print $1, $2, $3, "combined";
}'

How to print contents of column fields that have strings composed of "n" character/s using bash?

Say I have a file which contains:
22 30 31 3a 31 32 3a 32 " 0 9 : 1 2 : 2
30 32 30 20 32 32 3a 31 1 2 7 2 2 : 1
And, I want to print only the column fields that have string composed of 1 character. I want the output to be like this:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
Then, I want to print only those strings that are composed of two characters, the output should be:
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
I am a beginner and I really don't know how to do this. Thanks for your help!
Could you please try following, I am trying it in a different way for provided samples. Written and tested with provided samples only.
For getting values before BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,1,start[arr]-1)
sub(/ +$/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows.
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31
For getting values after BULK SPACE try:
awk '
{
line=$0
while(match($0,/[[:space:]]+/)){
arr=arr>RLENGTH?arr:RLENGTH
start[arr]+=RSTART+prev_start
prev_start=RSTART
$0=substr($0,RSTART+RLENGTH)
}
var=substr(line,start[arr])
sub(/^ +/,"",var)
print var
delete start
var=arr=""
}
' Input_file
Output will be as follows:
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
You can try
awk '{for(i=1;i<=NF;++i)if(length($i)==1)printf("%s ", $i);print("")}'
For each field, check the length and print it if it's desired. You may pass the -F option to awk if it's not separated by blanks.
The awk script is expanded as:
for( i = 1; i <= NF; ++i )
if( length( $i ) == 1 )
printf( "%s ", $i );
print( "" );
The print outside loop is to print a newline after each input line.
Assuming all the columns are tab-separated (So you can have a space as a column value like the second line of your sample), easy to do with a perl one-liner:
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^.$/ } #F' foo.txt
" 0 9 : 1 2 : 2
1 2 7 2 2 : 1
$ perl -F"\t" -lane 'BEGIN { $, = "\t" } print grep { /^..$/ } #F' foo.txt
22 30 31 3a 31 32 3a 32
30 32 30 20 32 32 3a 31

bash for loops not looping (awk, bash, linux)

Here is a sample dataset (10 cols, 2 rows):
8 1 4 10 7 9 2 3 6 5
0.001475 10.001 20.25 30.5 40.75 51 61.25 71.5 81.75 92
I would like to output ten files for each dataset. Each file will contain a unique value from the second row, and the filename will contain the value from the corresponding column in the first row.
(example: a file containing .001475, called foo_bar_8.1D
See my code below, intended for use on the following datasets:
OrderTimesKC_voxel_tuning_1.txt
OrderTimesKC_voxel_tuning_2.txt
OrderTimesKC_voxel_tuning_3.txt
OrderTimesKC_voxel_tuning_4.txt
OrderTimesKC_voxel_tuning_5.txt
Script:
subj='KC'
for j in {1..5}; do
for x in {1..10}; do
a=$(awk 'FNR == 1 {print $"$x"}' OrderTimes"$subj"_voxel_tuning_"$j".txt) #a == row 1, column x
b=$(awk 'FNR == 2 {print $"$x"}' OrderTimes"$subj"_voxel_tuning_"$j".txt) #b == row 2, column x
echo $b > voxTim_"$subj"_"$j"_"$a".1D
done
done
the current outputted files are:
voxTim_KC_1_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_2_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_3_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_4_8?1?4?10?7?9?2?3?6?5.1D
voxTim_KC_5_8?1?4?10?7?9?2?3?6?5.1D
these contain ten values per file, indicating that it is not looping correctly.
what I want is:
voxTim_KC_1_1.1D, voxTim_KC_1_2.1D, voxTim_KC_1_3.1D.....
voxTim_KC_2_1.1D, voxTim_KC_2_2.1D, voxTim_KC_2_3.1D.....
and so on..
Thank you!
awk to the rescue!
You can use awk more effectively, for example this script will do the extraction of the two values from each input file and create 10 (or actual number of columns) files with the data
$ awk 'FNR==1{c++; n=split($0,r1); next}
FNR==2{split($0,r2);
for(i=1;i<=n;i++) print r2[i] > "file."c"."r1[i]".1D"}' input1 input2
will create set of files for given input1 and input2 files. You can use this as a template and get rid of the for loops.
For example
$ tail -n 2 *
==> input1 <==
8 1 4 10 7 9 2 3 6 5
0.001475 10.001 20.25 30.5 40.75 51 61.25 71.5 81.75 92
==> input2 <==
98 91 94 910 97 99 92 93 96 95
0.001475 10.001 20.25 30.5 40.75 51 61.25 71.5 81.75 92
after running the script
$ ls
file.1.1.1D file.1.2.1D file.1.4.1D file.1.6.1D file.1.8.1D file.2.91.1D file.2.92.1D file.2.94.1D file.2.96.1D file.2.98.1D input1
file.1.10.1D file.1.3.1D file.1.5.1D file.1.7.1D file.1.9.1D file.2.910.1D file.2.93.1D file.2.95.1D file.2.97.1D file.2.99.1D input2
and contents
$ tail -n 2 file.1*
==> file.1.1.1D <==
10.001
==> file.1.10.1D <==
30.5
==> file.1.2.1D <==
61.25
==> file.1.3.1D <==
71.5
==> file.1.4.1D <==
20.25
etc...
actually, you can simply it further to
$ awk 'FNR==1{c++; n=split($0,r1)}
FNR==2{for(i=1;i<=n;i++) print $i > ("file."c"."r1[i]".1D")}' input1 input2
Just with bash:
subj=KC
for j in {1..5}; do
{
read -ra a # read the 1st line into array 'a'
read -ra b # read the 2nd line into array 'b'
for i in {0..9}; do
echo "${b[i]}" > "voxTim_${subj}_${j}_${a[i]}.1D"
done
} < "OrderTimes${subj}_voxel_tuning_${j}.txt"
done

Compare two files and write the unmatched numbers in a new file

I have two files where ifile1.txt is a subset of ifile2.txt.
ifile1.txt ifile2.txt
2 2
23 23
43 33
51 43
76 50
81 51
100 72
76
81
89
100
Desire output
ofile.txt
33
50
72
89
I was trying with
diff ifile1.txt ifile2.txt > ofile.txt
but it is giving different format of output.
Since your files are sorted, you can use the comm command for this:
comm -1 -3 ifile1.txt ifile2.txt > ofile.txt
-1 means omit the lines unique to the first file, and -3 means omit the lines that are in both files, so this shows just the lines that are unique to the second file.
This will do your job:
diff file1 file2 |awk '{print $2}'
You could try:
diff file1 file2 | awk '{print $2}' | grep -v '^$' > output.file

Cut a file between two lines numbers using awk

Say I have a file with 100 lines (not including header). I want to cut that file down, only keeping the content between line 51 and 70 (inclusive), as well as the header so that the resulting file is 20+1 lines.
So far, I have this code:
awk 'NR==1 {h=$0; next} (NR-1)>50 && (NR-1)<71 {filename = "file20.csv"; print h >> filename} {print >> filename}' file100.csv
But it's giving me an error:
fatal: expression for `>>' redirection has null string value
Can somebody help me understand where my syntax is wrong?
You can directly use:
awk 'NR==1 || (NR>=51 && NR<=70)'
Note that this evaluates the condition of NR. In case it is true, it performs awk's default action: {print $0}. Hence, you do not have to explicit it.
Then you can redirect to another file:
awk 'NR==1 || (NR>=51 && NR<=70)' file > new_file
Test
$ seq 100 | awk 'NR==1 || (NR>=51 && NR<=70)'
1
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
It returns 21 lines:
$ seq 100 | awk 'NR==1 || (NR>=51 && NR<=70)' | wc -l
21

Resources