Merge multiple columns from different files with a partial match via awk - linux

I have two files, A and B with the columns separated by \.
Column 2 of file A is exactly the same as column 1 of file B.
I want to merge these two files keeping file B the same, add a new column based on the same fields between the two files and a partial match between column 1 of file A and column 2 of file B.
By partial match I mean something like this:
File A (column 1)
File B (column 2)
A=B?
A
A?
True
A
Asd
True
B
B
True
C
c
True
C
CA
True
D
A
False
If there are values with the same column 1 and 2 in file A, they must be added to file B separated by ;
File A
A\2022.10.10\note a
A\2022.10.10\note b
B\2022.10.14\note c
A\2022.10.14\note d
C\2022.10.15\note e
File B
2022.10.10\A?
2022.10.14\B?
2022.10.14\a
2022.10.15\C
2022.10.15\D
Desired output
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\
How can I do this with awk?

An awk script might work depending on the details of your requirements (e.g. of the keys are case sensitive or not).
An awk script like this might work:
function make_key(k)
{
# either return k or an uppercase version for case-insensitive keys
# return k;
return toupper(k);
}
BEGIN {
FS="\\";
}
NR==FNR {
key=make_key($2 "\\" $1);
if( key in notes){
notes[key]=notes[key] ";" $3
}
else {
notes[key]=$3
}
}
NR!=FNR {
for(k in notes){
pos=index(make_key($0),k);
if(pos==1){
printf "%s%s%s%s\n", $0, FS, notes[k], FS;
next;
}
}
print $0 FS;
}
You would use it like this:
awk -f script.awk file_A file_B
In the function make_key you can configure the case sensitiveness by either returning k or an uppercase version.
The NF==FNR block is used during reading the first file (file_A) here the notes are stored under a key made out of the second and first column. Notes are appended if the key is already existing.
In the NF!=FNR block the second file (file_B) is read. Here we print the line of file_b and the matching notes by comparing every key if the line of file_B starts with the key. If no key matches, then only the line of file_B is printed.
The BEGIN block just sets up the field separator.

To process the first file, we could use condition NR==FNR. During the first file processing, I used two hash tables, namely pattern_ht to store column 1 values in lower case and note_ht to store column 3 values. Since date string itself is not unique to serve as key, we will use the concatenation of columns 2 and 3 as key for both hash tables.
When processing the second file, firstly check if date string matches column 1 exactly after performing a split of the key string. If matched, perform the partial matching of column 2 to pattern_ht values after converting to lower case. If it matches and this is the first match, record the third_col value. If it already has another value, append to it with ";". Finally, display accordingly:
awk -F'\' 'NR==FNR {pattern_ht[$2";"$3] = tolower($1); note_ht[$2";"$3] = $3; next}
{third_col="";
for (key in pattern_ht) {
split(key,date_str,";");
if (date_str[1] == $1) {
if (tolower($2) ~ pattern_ht[key]) {
if (length(third_col) == 0)
third_col = note_ht[key]
else
third_col = third_col ";" note_ht[key]
}
}
}
if (length(third_col) != 0)
print $1"\\"$2"\\"third_col"\\"
else
print $1"\\"$2"\\"
}' fileA fileB
Here is the output:
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\

Related

How to count strings in specified field within each line of one or more csv files

Writing a Python program (ver. 3) to count strings in a specified field within each line of one or more csv files.
Where the csv file contains:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv
And the result is:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
F 1
G 1
W 1
ERROR
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv file.csv file.csv
Where the error occurs:
for rowitem in reader:
for pos in field:
pos = rowitem[pos] ##<---LINE generating error--->##
if pos not in fieldcnt:
fieldcnt[pos] = 1
else:
fieldcnt[pos] += 1
TypeError: list indices must be integers or slices, not str
Thank you!
Judging from the output, I'd say that the fields in the csv file does not influence the count of the string. If the string uniqueness is case-insensitive please remember to use yourstring.lower() to return the string so that different case matches are actually counted as one. Also do keep in mind that if your text is large the number of unique strings you might find could be very large as well, so some sort of sorting must be in place to make sense of it! (Or else it might be a long list of random counts with a large portion of it being just 1s)
Now, to get a count of unique strings using the collections module is an easy way to go.
file = open('yourfile.txt', encoding="utf8")
a= file.read()
#if you have some words you'd like to exclude
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['<media','omitted>','it\'s','two','said']))
# make an empty key-value dict to contain matched words and their counts
wordcount = {}
for word in a.lower().split(): #use the delimiter you want (a comma I think?)
# replace punctuation so they arent counted as part of a word
word = word.replace(".","")
word = word.replace(",","")
word = word.replace("\"","")
word = word.replace("!","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
That should do it. The wordcount dict should contain the word and it's frequency. After that just sort it using collections and print it out.
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(20):
print(word, ": ", count)
I hope this solves your problem. Lemme know if you face problems.

Compare multiple columns for each row

Using a csv file, i will like to compare multiple columns to check if all values are the same or not.
First row are the headers
First column is the label
The constant values should be from column 2 to the end ( can be 100 columns ) for the example i put only 8 columns.
The purpose is to check that all values are the same. and when it is not, report
input file
Number,V2 1563,V03-1555,V4 - 294,V-05 1580,V6-1561,V7-1562,V05-1601,V9-1587
Code,4.1.06,4.1.03,4.1.06,4.1.06,4.1.06,4.1.06,4.1.06,4.1.06
Host Id,b90c27,b90c13,3.30E+65,b90c46,b90c21,b90c1f,b88a63,b90c49
SR,SR_2_MS,SR_2_MS,SR_4_MS,SR_2_MS,SR_2_MS,SR_2_MS,SR_2_MS,SR_2_MS
output desired
Bad code in V03-1555
Bad SR in V4 - 294
Appreciate your support
awk to the rescue!
I improvised little bit. How do we know which values are correct, which are not? Popular vote, counts the occurrences and assumes majority is right. As a side benefit, if all values are different as in your "Host Id" row, nothing is reported
$ awk -F, 'NR==1 {split($0,h); next}
{delete r;
for(i=2;i<=NF;i++) {r[$i]++; idx[$i]=i}
max=0;
for(k in r) if(max<r[k]) max=r[k];
if(length(r)>1)
for(k in r)
if(r[k]!=max)
print "Bad " $1 " in " h[idx[k]] " -> " k}' file
returns
Bad Code in V03-1555 -> 4.1.03
Bad SR in V4 - 294 -> SR_4_MS
you can remove the values printed, which I put for verification.

Manipulating undefined entries in a list

I am trying to manipulate a list (about 50 columns) where I basically want to select some columns (some 7 or 10). However, some of those columns have empty entries. I am guessing something like this is a minimal working example:
A B C D E#note these are 5 tab separated columns
this that semething something more the end
this.line is very incomplete #column E empty
but this is v.very complete
whereas this is not #column B empty
As you can see, the 3rd line is empty in the last position.
I want to find a way of efficiently replacing all empty fields of my interest by a string, say "NA".
Of course, I could do it in the following way, but it is not very elegant to do this for all the 10 columns that I have in my real data:
#!/usr/local/bin/perl
use strict;
use warnings;
open my $file,"<","$path\\file.txt"; #with correct path
my #selecteddata;my $blankE;my $blankB;
while (<$data>) {
chomp $_;
my #line= split "\t";
if (not defined $line[4]){
$blankE="NA";
} else {
$blankE=$line[4];
}
if (not defined $line[1]){
$blankB="NA";
} else {
$blankB=$line[1];
}
push #selecteddata,"$blankB[0]\t$line[1]\t$line[2]\t$line[3]$line[4]\n";
}
close $data;
Alternatively, I can pre-process the file and replace all undefined entries by "NA", but I would like to avoid this.
So the main question is this: is there a more elegant way to replace blank entries only in the columns that I am interested by some word?
Thank you!
The trick to not ignoring trailing tabs is to specify a negative LIMIT as the 4th argument to split (kudos ikegami).
map makes light work of setting the "NA" values:
while ( <$data> ) {
chomp;
my #fields = split /\t/, $_, -1;
#fields = map { length($_) ? $_ : 'NA' } #fields; # Transform #fields
my $updated = join("\t", #fields) . "\n";
push #selected_data, $updated ;
}
In one-liner mode:
$ perl -lne 'print join "\t", map { length ? $_ : "NA" } split /\t/, $_, -1' input > output
I would say that using split and join undoubtedly is the most clear, since you'll likely need to be doing that for other parsing as well. However, this could be solved using look around assertions as well
Basically, the boundary between elements will either be a tab or the end or beginning of a string, so if those conditions are true for both directions, then we have an empty field:
use strict;
use warnings;
while (<DATA>) {
s/(?:^|(?<=\t))(?=\t|$)/NA/g;
print;
}
__DATA__
a b c d e
a b c d e
a b d e
b c d e
a b
a b d
a e
Outputs:
a b c d e
a b c d e
a b NA d e
NA b c d e
a b NA NA NA
a b NA d NA
a NA NA NA e
Turning this into a one liner is trivial, but I will point out that this could be done using \K as well saving 2 characters: s/(?:\t|^)\K(?=\t|$)/NA/g;
I'm not sure if just using a sequence of substitutions looking for tabs that are either preceded/followed by spaces would catch everything but it's quick and easy if you have a lazy brain ;-)
perl -pne 's/\t\t/\tNA\t/;s/\t\s/\tNA/;s/^\t/NA\t/' col_data-undef.txt
I'm not sure if in a neater scriptish format it looks less or more yucky :-)
#!/usr/bin/env perl
# read_cols.pl - munge tab separated data with empty "cells"
use strict;
use warnings;
while (<>){
s/\t\t/\tNA\t/;
s/\t\s/\tNA/;
s/^\t/NA\t/;
print ;
}
Here's the output:
Here's vim buffers of the input and output with tabs as ^I in red ;-)
./read_cols.pl col_data-undef.txt > col_data-NA.txt
Is everything in the correct order? Would it work on 50 columns ?!?
Sometimes lazy is good but sometimes you need #ikegami ... :-)

AWK reporting duplicate lines and count, program explanation

I found the following AWK program on the internet and tweaked it slightly to look at column $2:
{ a[$2,NR]=$0; c[$2]++ }
END {
for( k in a ) {
split(k,b,SUBSEP)
t=c[b[1]] # added this bit to capture count
if( b[1] in c && t>1 ) { # added && t>1 only print if count more than 1
print RS "TIMES ID" RS c[b[1]] " " b[1] RS
delete c[b[1]]
}
for(i=1;i<=NR;i++) if( a[b[1],i] ) {
if(t>1){print a[b[1],i]} # added if(t>1) only print lines if count more than 1
delete a[b[1],i]
}
}
}
Given the following file:
abc,2,3
def,3,4
ghi,2,3
jkl,5,9
mno,3,2
The output is as follows when the command is run:
Command: awk -F, -f find_duplicates.awk duplicates
Output:
TIMES ID
2 2
abc,2,3
ghi,2,3
TIMES ID
2 3
def,3,4
mno,3,2
This is fine.
I would like to understand what is happening in the AWK program.
I understand that the first line is loading each line into a multidimentional array ?
So first line of file would be a['2','1']='abc,2,3' and so on.
However I'm a bit confised as to what c[$2]++ does, and also what is the significance of split(k,b,SUBSEP) ??
Would appreciate it if someone could explain line by line what is going on in this AWK program.
Thanks.
The increment operator simply adds one to the value of the referenced variable. So c[$2]++ takes the value for c[$2] and adds one to it. If $2 is a and c["a"] was 3 before, its value will be 4 after this. So c keeps track of how many of each $2 value you have seen.
for (k in a) loops over the keys of a. If the value of $2 on the first line was "a", the first value of k will be "a","1" (with 1 being the line number). The next time, it will be the combination of the value of $2 from the second line and the line number 2, etc.
The split(k,b,SUBSEP) will create a new array b from the compound value in k, i.e. basically reconstruct the parts of the compound key that went into a. The value in b[1] will now be the value which was in $2 when the corresponding value in a was created, and the value in b[2] will be the corresponding line number.
The final loop is somewhat inefficient; it loops over all possible line numbers, then skips immediately to the next one if an entry for that ID and line number did not exist. Because this runs inside the outer loop for (k in a) it will be repeated a large number of times if you have a large number of inputs (it will loop over all input line numbers for each input line). It would be more efficient, at the expense of some additional memory, to just build a final output incrementally, then print it all after you have looped over all of a, by which time you have processed all input lines anyway. Perhaps something like this:
END {
for (k in a) {
split (k,b,SUBSEP)
if (c[b[1]] > 1) {
if (! o[b[1]]) o[b[1]] = c[b[1]] " " b[1] RS
o[b[1]] = o[b[1]] RS a[k]
}
delete a[k]
}
for (q in o) print o[q] RS
}
Update: Removed the premature deletion of c[b[1]].

How do I read a delimited file with strings/numbers with Octave?

I am trying to read a text file containing digits and strings using Octave. The file format is something like this:
A B C
a 10 100
b 20 200
c 30 300
d 40 400
e 50 500
but the delimiter can be space, tab, comma or semicolon. The textread function works fine if the delimiter is space/tab:
[A,B,C] = textread ('test.dat','%s %d %d','headerlines',1)
However it does not work if delimiter is comma/semicolon. I tried to use dklmread:
dlmread ('test.dat',';',1,0)
but it does not work because the first column is a string.
Basically, with textread I can't specify the delimiter and with dlmread I can't specify the format of the first column. Not with the versions of these functions in Octave, at least. Has anybody ever had this problem before?
textread allows you to specify the delimiter-- it honors the property arguments of strread. The following code worked for me:
[A,B,C] = textread( 'test.dat', '%s %d %d' ,'delimiter' , ',' ,1 )
I couldn't find an easy way to do this in Octave currently. You could use fopen() to loop through the file and manually extract the data. I wrote a function that would do this on arbitrary data:
function varargout = coltextread(fname, delim)
% Initialize the variable output argument
varargout = cell(nargout, 1);
% Initialize elements of the cell array to nested cell arrays
% This syntax is due to {:} producing a comma-separated
[varargout{:}] = deal(cell());
fid = fopen(fname, 'r');
while true
% Get the current line
ln = fgetl(fid);
% Stop if EOF
if ln == -1
break;
endif
% Split the line string into components and parse numbers
elems = strsplit(ln, delim);
nums = str2double(elems);
nans = isnan(nums);
% Special case of all strings (header line)
if all(nans)
continue;
endif
% Find the indices of the NaNs
% (i.e. the indices of the strings in the original data)
idxnans = find(nans);
% Assign each corresponding element in the current line
% into the corresponding cell array of varargout
for i = 1:nargout
% Detect if the current index is a string or a num
if any(ismember(idxnans, i))
varargout{i}{end+1} = elems{i};
else
varargout{i}{end+1} = nums(i);
endif
endfor
endwhile
endfunction
It accepts two arguments: the file name, and the delimiter. The function is governed by the number of return variables that are specified, so, for example, [A B C] = coltextread('data.txt', ';'); will try to parse three different data elements from each row in the file, while A = coltextread('data.txt', ';'); will only parse the first elements. If no return variable is given, then the function won't return anything.
The function ignores rows that have all-strings (e.g. the 'A B C' header). Just remove the if all(nans)... section if you want everything.
By default, the 'columns' are returned as cell arrays, although the numbers within those arrays are actually converted numbers, not strings. If you know that a cell array contains only numbers, then you can easily convert it to a column vector with: cell2mat(A)'.

Resources