Compare multiple columns for each row - linux

Using a csv file, i will like to compare multiple columns to check if all values are the same or not.
First row are the headers
First column is the label
The constant values should be from column 2 to the end ( can be 100 columns ) for the example i put only 8 columns.
The purpose is to check that all values are the same. and when it is not, report
input file
Number,V2 1563,V03-1555,V4 - 294,V-05 1580,V6-1561,V7-1562,V05-1601,V9-1587
Code,4.1.06,4.1.03,4.1.06,4.1.06,4.1.06,4.1.06,4.1.06,4.1.06
Host Id,b90c27,b90c13,3.30E+65,b90c46,b90c21,b90c1f,b88a63,b90c49
SR,SR_2_MS,SR_2_MS,SR_4_MS,SR_2_MS,SR_2_MS,SR_2_MS,SR_2_MS,SR_2_MS
output desired
Bad code in V03-1555
Bad SR in V4 - 294
Appreciate your support

awk to the rescue!
I improvised little bit. How do we know which values are correct, which are not? Popular vote, counts the occurrences and assumes majority is right. As a side benefit, if all values are different as in your "Host Id" row, nothing is reported
$ awk -F, 'NR==1 {split($0,h); next}
{delete r;
for(i=2;i<=NF;i++) {r[$i]++; idx[$i]=i}
max=0;
for(k in r) if(max<r[k]) max=r[k];
if(length(r)>1)
for(k in r)
if(r[k]!=max)
print "Bad " $1 " in " h[idx[k]] " -> " k}' file
returns
Bad Code in V03-1555 -> 4.1.03
Bad SR in V4 - 294 -> SR_4_MS
you can remove the values printed, which I put for verification.

Related

Merge multiple columns from different files with a partial match via awk

I have two files, A and B with the columns separated by \.
Column 2 of file A is exactly the same as column 1 of file B.
I want to merge these two files keeping file B the same, add a new column based on the same fields between the two files and a partial match between column 1 of file A and column 2 of file B.
By partial match I mean something like this:
File A (column 1)
File B (column 2)
A=B?
A
A?
True
A
Asd
True
B
B
True
C
c
True
C
CA
True
D
A
False
If there are values with the same column 1 and 2 in file A, they must be added to file B separated by ;
File A
A\2022.10.10\note a
A\2022.10.10\note b
B\2022.10.14\note c
A\2022.10.14\note d
C\2022.10.15\note e
File B
2022.10.10\A?
2022.10.14\B?
2022.10.14\a
2022.10.15\C
2022.10.15\D
Desired output
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\
How can I do this with awk?
An awk script might work depending on the details of your requirements (e.g. of the keys are case sensitive or not).
An awk script like this might work:
function make_key(k)
{
# either return k or an uppercase version for case-insensitive keys
# return k;
return toupper(k);
}
BEGIN {
FS="\\";
}
NR==FNR {
key=make_key($2 "\\" $1);
if( key in notes){
notes[key]=notes[key] ";" $3
}
else {
notes[key]=$3
}
}
NR!=FNR {
for(k in notes){
pos=index(make_key($0),k);
if(pos==1){
printf "%s%s%s%s\n", $0, FS, notes[k], FS;
next;
}
}
print $0 FS;
}
You would use it like this:
awk -f script.awk file_A file_B
In the function make_key you can configure the case sensitiveness by either returning k or an uppercase version.
The NF==FNR block is used during reading the first file (file_A) here the notes are stored under a key made out of the second and first column. Notes are appended if the key is already existing.
In the NF!=FNR block the second file (file_B) is read. Here we print the line of file_b and the matching notes by comparing every key if the line of file_B starts with the key. If no key matches, then only the line of file_B is printed.
The BEGIN block just sets up the field separator.
To process the first file, we could use condition NR==FNR. During the first file processing, I used two hash tables, namely pattern_ht to store column 1 values in lower case and note_ht to store column 3 values. Since date string itself is not unique to serve as key, we will use the concatenation of columns 2 and 3 as key for both hash tables.
When processing the second file, firstly check if date string matches column 1 exactly after performing a split of the key string. If matched, perform the partial matching of column 2 to pattern_ht values after converting to lower case. If it matches and this is the first match, record the third_col value. If it already has another value, append to it with ";". Finally, display accordingly:
awk -F'\' 'NR==FNR {pattern_ht[$2";"$3] = tolower($1); note_ht[$2";"$3] = $3; next}
{third_col="";
for (key in pattern_ht) {
split(key,date_str,";");
if (date_str[1] == $1) {
if (tolower($2) ~ pattern_ht[key]) {
if (length(third_col) == 0)
third_col = note_ht[key]
else
third_col = third_col ";" note_ht[key]
}
}
}
if (length(third_col) != 0)
print $1"\\"$2"\\"third_col"\\"
else
print $1"\\"$2"\\"
}' fileA fileB
Here is the output:
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\

Print some columns in multiline string

Is there any way to print some columns in a string which is in several lines. For instance, let's suppose we have the following string:
EXAMPLE1
- -- ---
EXAMPLE2
And I was only print the columns which has '-' in columns. So the the output for this case should be:
EAMLE1
------
EAMLE2
I was thinking of splitting the string and iterate throug every column by using zip and print just those columns which have '-' But don't really know how to use it properly.
Any idea would be welcomed
thanks in advance
Once we split the string into lines, we can use zip(*lines) to transpose the list, getting the columns, search those for -, and then transpose again to get the new lines. Then we can use str.join to assemble the result.
s = '''\
EXAMPLE1
- -- ---
EXAMPLE2'''
columns = (tup for tup in zip(*s.split('\n')) if any('-' in x for x in tup))
lines = (''.join(line) for line in zip(*columns))
print('\n'.join(lines))
Output:
EAMLE1
------
EAMLE2

How to get lines count in string?

On the whole, I get a string from JSON pair which contain "\n" symbols. For example,
"I can see how the earth nurtures its grass,\nSeparating fine grains from lumpy earth,\nPiercing itself with its own remains\nEnduring the crawling insects on its surface.\nI can see how like a green wave\nIt lifts the soil, swelling it up,\nAnd how the roots penetrate the surrounding mulch\nHappily inhaling the air in the sky.\nI can see how the light illuminates the flowers, -\nPouring itself into their tight buds!\nThe earth and the grass – continue to grow!\nDrowning the mountains in a sea of green...\nOh, The power of motion of the young,\nThe muscular pull of the plants!\nOpening up to the planet, the sun and to you,\nBreaking through the undergrowth to the fresh spring air!"
This string is a poetry for some picture.
Now I need to resize my display.newText object according to text length.
Here is how I see to do that:
Get number of lines (number of "\n" + 1, because where is no "\n" in the end)
In for loop get the longest line
Set display.newText object's size. May be using fontSize for calculating coefficient...
Question is: How to get number of lines?
To get the number of '\n' in a string, you can use string.gsub, it's used for string substitution, but it also returns the number of matches as the second return value.
local count = select(2, str:gsub('\n', '\n'))
or similar:
local _, count = str:gsub('\n', '\n')
This is apparently way faster than #Yu Hao's two solutions
local function get_line_count(str)
local lines = 1
for i = 1, #str do
local c = str:sub(i, i)
if c == '\n' then lines = lines + 1 end
end
return lines
end

Manipulating undefined entries in a list

I am trying to manipulate a list (about 50 columns) where I basically want to select some columns (some 7 or 10). However, some of those columns have empty entries. I am guessing something like this is a minimal working example:
A B C D E#note these are 5 tab separated columns
this that semething something more the end
this.line is very incomplete #column E empty
but this is v.very complete
whereas this is not #column B empty
As you can see, the 3rd line is empty in the last position.
I want to find a way of efficiently replacing all empty fields of my interest by a string, say "NA".
Of course, I could do it in the following way, but it is not very elegant to do this for all the 10 columns that I have in my real data:
#!/usr/local/bin/perl
use strict;
use warnings;
open my $file,"<","$path\\file.txt"; #with correct path
my #selecteddata;my $blankE;my $blankB;
while (<$data>) {
chomp $_;
my #line= split "\t";
if (not defined $line[4]){
$blankE="NA";
} else {
$blankE=$line[4];
}
if (not defined $line[1]){
$blankB="NA";
} else {
$blankB=$line[1];
}
push #selecteddata,"$blankB[0]\t$line[1]\t$line[2]\t$line[3]$line[4]\n";
}
close $data;
Alternatively, I can pre-process the file and replace all undefined entries by "NA", but I would like to avoid this.
So the main question is this: is there a more elegant way to replace blank entries only in the columns that I am interested by some word?
Thank you!
The trick to not ignoring trailing tabs is to specify a negative LIMIT as the 4th argument to split (kudos ikegami).
map makes light work of setting the "NA" values:
while ( <$data> ) {
chomp;
my #fields = split /\t/, $_, -1;
#fields = map { length($_) ? $_ : 'NA' } #fields; # Transform #fields
my $updated = join("\t", #fields) . "\n";
push #selected_data, $updated ;
}
In one-liner mode:
$ perl -lne 'print join "\t", map { length ? $_ : "NA" } split /\t/, $_, -1' input > output
I would say that using split and join undoubtedly is the most clear, since you'll likely need to be doing that for other parsing as well. However, this could be solved using look around assertions as well
Basically, the boundary between elements will either be a tab or the end or beginning of a string, so if those conditions are true for both directions, then we have an empty field:
use strict;
use warnings;
while (<DATA>) {
s/(?:^|(?<=\t))(?=\t|$)/NA/g;
print;
}
__DATA__
a b c d e
a b c d e
a b d e
b c d e
a b
a b d
a e
Outputs:
a b c d e
a b c d e
a b NA d e
NA b c d e
a b NA NA NA
a b NA d NA
a NA NA NA e
Turning this into a one liner is trivial, but I will point out that this could be done using \K as well saving 2 characters: s/(?:\t|^)\K(?=\t|$)/NA/g;
I'm not sure if just using a sequence of substitutions looking for tabs that are either preceded/followed by spaces would catch everything but it's quick and easy if you have a lazy brain ;-)
perl -pne 's/\t\t/\tNA\t/;s/\t\s/\tNA/;s/^\t/NA\t/' col_data-undef.txt
I'm not sure if in a neater scriptish format it looks less or more yucky :-)
#!/usr/bin/env perl
# read_cols.pl - munge tab separated data with empty "cells"
use strict;
use warnings;
while (<>){
s/\t\t/\tNA\t/;
s/\t\s/\tNA/;
s/^\t/NA\t/;
print ;
}
Here's the output:
Here's vim buffers of the input and output with tabs as ^I in red ;-)
./read_cols.pl col_data-undef.txt > col_data-NA.txt
Is everything in the correct order? Would it work on 50 columns ?!?
Sometimes lazy is good but sometimes you need #ikegami ... :-)

AWK reporting duplicate lines and count, program explanation

I found the following AWK program on the internet and tweaked it slightly to look at column $2:
{ a[$2,NR]=$0; c[$2]++ }
END {
for( k in a ) {
split(k,b,SUBSEP)
t=c[b[1]] # added this bit to capture count
if( b[1] in c && t>1 ) { # added && t>1 only print if count more than 1
print RS "TIMES ID" RS c[b[1]] " " b[1] RS
delete c[b[1]]
}
for(i=1;i<=NR;i++) if( a[b[1],i] ) {
if(t>1){print a[b[1],i]} # added if(t>1) only print lines if count more than 1
delete a[b[1],i]
}
}
}
Given the following file:
abc,2,3
def,3,4
ghi,2,3
jkl,5,9
mno,3,2
The output is as follows when the command is run:
Command: awk -F, -f find_duplicates.awk duplicates
Output:
TIMES ID
2 2
abc,2,3
ghi,2,3
TIMES ID
2 3
def,3,4
mno,3,2
This is fine.
I would like to understand what is happening in the AWK program.
I understand that the first line is loading each line into a multidimentional array ?
So first line of file would be a['2','1']='abc,2,3' and so on.
However I'm a bit confised as to what c[$2]++ does, and also what is the significance of split(k,b,SUBSEP) ??
Would appreciate it if someone could explain line by line what is going on in this AWK program.
Thanks.
The increment operator simply adds one to the value of the referenced variable. So c[$2]++ takes the value for c[$2] and adds one to it. If $2 is a and c["a"] was 3 before, its value will be 4 after this. So c keeps track of how many of each $2 value you have seen.
for (k in a) loops over the keys of a. If the value of $2 on the first line was "a", the first value of k will be "a","1" (with 1 being the line number). The next time, it will be the combination of the value of $2 from the second line and the line number 2, etc.
The split(k,b,SUBSEP) will create a new array b from the compound value in k, i.e. basically reconstruct the parts of the compound key that went into a. The value in b[1] will now be the value which was in $2 when the corresponding value in a was created, and the value in b[2] will be the corresponding line number.
The final loop is somewhat inefficient; it loops over all possible line numbers, then skips immediately to the next one if an entry for that ID and line number did not exist. Because this runs inside the outer loop for (k in a) it will be repeated a large number of times if you have a large number of inputs (it will loop over all input line numbers for each input line). It would be more efficient, at the expense of some additional memory, to just build a final output incrementally, then print it all after you have looped over all of a, by which time you have processed all input lines anyway. Perhaps something like this:
END {
for (k in a) {
split (k,b,SUBSEP)
if (c[b[1]] > 1) {
if (! o[b[1]]) o[b[1]] = c[b[1]] " " b[1] RS
o[b[1]] = o[b[1]] RS a[k]
}
delete a[k]
}
for (q in o) print o[q] RS
}
Update: Removed the premature deletion of c[b[1]].

Resources